Javascript regexps demystified

Coding

Regexps are  pain, but they are extremely powerful. They have a lot of uses from validation, text recognition and file parsing.  Lately I’ve been working on Javascript a lot and of course I ended up in using regexps to validate user input in ExtJs.

One very good resources to find regexp or base for your regexp is for sure RegexLib in which a huge collection of regexps is exposed. 90% of the time I’ve to write a regexp I start browsing that site for ideas or pattern to use in my regexp. From this base i tend then to build my own regexp like in this case:


portsList = /^$|^(6[0-5]?[0-9]{0,2}[0-5]?|6[0-4]?[0-9]{0,3}|[1-5][0-9]{0,4}|[1-9][0-9]{0,3})(-(6[0-5]?[0-9]{0,2}[0-5]?|6[0-4]?[0-9]{0,3}|[1-5][0-9]{0,4}|[1-9][0-9]{0,3}))?((\s*,\s*)(6[0-5]?[0-9]{0,2}[0-5]?|6[0-4]?[0-9]{0,3}|[1-5][0-9]{0,4}|[1-9][0-9]{0,3})(-(6[0-5]?[0-9]{0,2}[0-5]?|6[0-4]?[0-9]{0,3}|[1-5][0-9]{0,4}|[1-9][0-9]{0,3}))?)*$/;

This *huge and hard to maintain* regexp recognizes a valid TCP/UDP port or port range which is a number between 0 and 65535 for port and a port-port for port range (there is an error in that regexp, have fun in finding it 😉 ). Never to say, that code is unmainteable, that’s why I usually try to describe regexps in a EBNF way. In EBNF you split your regexp in smaller regexp in variables such that you can reuse them.  For example we can define:

  • port = […….]
  • portList = port || port-port

This is usually easy in programming languages (like in python with its re module) but I found a little bit more tricky in javascript. So here are my rules:

  1. Always use the RegExp object
  2. Always compose regexp from strings
  3. Always isolate regexp string between parenthesis.

In javascript you can create regexp using the RegExp object or using the / / operator. The advantage of the former is that you can compose the regexp using  strings. Look at this example:


var my_regexp = new RegExp("^hello world$");

var my_regexp = /^hello world$/;

The two statements are equivalent. The only difference is that the former is constructed with a string and we can thus put the string into a variable and compose those variables to a valid regexp string.

Isolation of the regexp between parenthesis may look verbose but makes the regexp less error prone and easy to debug. Applying all the rules you would be able to do something like:


var port = "(6553[0-5]|655[0-2]\\d|65[0-4]\\d\\d|6[0-4]\\d{3}|[1-5]\\d{4}|[1-9]\\d{0,3}|0)";

var port_or_port_range = "("+port+"(-"+port+")?)";

var my_regexp = new RegExp("^("+port_or_port_range+")((\\s*,\\s*)("+port_or_port_range+"))*$");

With these three lines we accomplished the same job of the first long regexp I posted. Much easier to maintain and use. Note the leading and trailing ) to the port regexp string and the my_regexp that uses the regexp object.

Previous
XBian 1.0 beta 1.1 released, let’s give it a try
Next
Svn in the git era without driving insane

Leave a Reply

%d bloggers like this: