I admit that regular expressions (regexes) are intimidating. I avoided them like a disease for six years. I wasn't quite sure what they did, but seeing stuff like this was enough to make me lose interest quick:
<(?<tagname>[\\w:-]+)(?<attrpair>\\s+(?<attrname>\\w[-\\w:]*)
(\\s*=\\s*\"(?<attrval>[^\"]*)\"|
\\s*=\\s*'(?<attrval>[^']*)'|
\\s*=\\s*(?<attrval><%#.*?%>)|
\\s*=\\s*(?<attrval>[^\\s=/>]*)|
(?<attrval>\\s*?)))*
\\s*(?<insertattrs>)(?:(?<empty>/>)|>
(?s:(?<contents>.*?)
(?i:</\\s*\\k<tagname>\\s*>)))
Whenever I did string parsing, I usually ended up with masses of loops and indexOf calls. I tried to avoid anything complicated, but the parsing code inevitably ended up huge, fragile, and difficult to read.
During those six years, I probably wrote over ten thousand lines of string parsing code that could have been replaced with
ten or twenty lines of regular expressions. Really.
If you do any kind of programming, or even just work with text, you need to understand regular expressions. They're very, very useful. So useful, in fact, that it's near impossible to find a language that doesn't support them. Perl, PHP, Ruby and Javascript even add syntactic sugar to make them easier to use.
Regex patterns
Regex patterns don't have to be complicated to be useful. You'll probably get the most mileage out of simple patterns - quick regexes that save you a few dozen lines of code.
For example, you might scan text for simple e-mail addresses with this pattern:
[a-z
]+
@
[
a-z
]+\.[
a-z
]
Within square brackets, you can list allowed characters like this:
[abcdefghijklm] or specify a range:
[a-m].
The + symbol means 'one or more' of whatever it is following. So "A+" would match A or AAA or AAAAA. (*) means '0 or more'.
The period (.) means 'any character'. Since we're actually trying to
match a period, we have to 'escape' it with a backslash: \.
Unfortunately, the above regex dosen't match e-mail addresses like
[email protected]. It also won't match
[email protected] or
[email protected]. Here's an improved version.
[a-z0-9._-]+@[a-z0-9.-]+\.[a-z]
And here's an even more sophisticated version. It excludes domains that start or end in an hyphen. It also prevents more than one period in a row in the domain or username.
[a-z0-9!#$%&'*+/=?^_`{|}~-
]+(?:\\.
[a-z0-9!#$%&'*+/=?^_`{|}~-
]+)*
@
(?:[a-z0-9
](?:[a-z0-9-
]*[a-z0-9
])?\\.
)+[a-z0-9
](?:[a-z0-9-
]*[a-z0-9
])?
Experiment
http://rubular.com/
Continued
http://en.wikipedia.org/wiki/Regular_expression
http://www.regular-expressions.info/