Regular expressions are mandatory

I admit that regular expressions (regexes) are intimidating. I avoided them like a disease for six years. I wasn't quite sure what they did, but seeing stuff like this was enough to make me lose interest quick:
Whenever I did string parsing, I usually ended up with masses of loops and indexOf calls. I tried to avoid anything complicated, but the parsing code inevitably ended up huge, fragile, and difficult to read. During those six years, I probably wrote over ten thousand lines of string parsing code that could have been replaced with ten or twenty lines of regular expressions. Really. If you do any kind of programming, or even just work with text, you need to understand regular expressions. They're very, very useful. So useful, in fact, that it's near impossible to find a language that doesn't support them. Perl, PHP, Ruby and Javascript even add syntactic sugar to make them easier to use.

Regex patterns

Regex patterns don't have to be complicated to be useful. You'll probably get the most mileage out of simple patterns - quick regexes that save you a few dozen lines of code. For example, you might scan text for simple e-mail addresses with this pattern: [a-z ]+ @ [ a-z ]+\.[ a-z ] Within square brackets, you can list allowed characters like this: [abcdefghijklm] or specify a range: [a-m]. The + symbol means 'one or more' of whatever it is following. So "A+" would match A or AAA or AAAAA. (*) means '0 or more'. The period (.) means 'any character'. Since we're actually trying to match a period, we have to 'escape' it with a backslash: \. Unfortunately, the above regex dosen't match e-mail addresses like [email protected]. It also won't match [email protected] or [email protected]. Here's an improved version. [a-z0-9._-]+@[a-z0-9.-]+\.[a-z] And here's an even more sophisticated version. It excludes domains that start or end in an hyphen. It also prevents more than one period in a row in the domain or username. [a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)* @(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?



Published on

About Nathanael

Nathanael Jones is a software engineer, father, consultant, and computer linguist with unreasonably high expectations of inanimate objects. He refines .NET, ruby, and javascript libraries full-time at Imazen, but can often be found on stack overflow or participating in W3C community groups.


If you develop websites, and those websites have images, ImageResizer can make your life much eaiser. Find out more at


I run Imazen, a tiny software company that specializes in web-based image processing and other difficult engineering problems. I spend most of my time writing image-processing code in C#, web apps in Ruby, and documentation in Markdown. Check out some of my current projects.

More articles