Regular Expression Lookarounds

(This post deals with creating a single regular expression that matches a valid string syntax and then only matches a specific subset of those valid strings.)

I've been working with regular expressions on my current project.  I've used them in the past but I bumped into something that stumped me with my current problem.  I've got a list of file names that I'm checking for validity with our document control naming scheme.  My current task is to look for any file names that have lowercase letters but are otherwise valid.  These will then be flagged for review and then automatically fixed to uppercase letters.

My first step was to create a regular expression that matched valid file names:
^[0-9]{3}-[0-9]{4}-[0-9A-Za-z][0-9]{2}-[0-9a-zA-Z]{2}-[0-9A-Za-z]{1,3}

In a more human-readable form, this means that the file names must be of the form: ###-####-X##-XX-Y where # is any digit, X is any alphanumeric character, and Y is a section of alphanumeric characters of variable length between 1 and 3 characters.  Example valid name: 123-4567-A89-B0-11

This regular expression worked out exactly as intended, but as part of this project I also learned some shorthand which simplifies the above expression and makes it more readable:
^\d{3}-\d{4}-[[:alnum:]]\d{2}-[[:alnum:]]{2}-[[:alnum:]]{1,3}

This takes advantage of two shorthand elements:  \d is equivalent to [0-9] and [[:alnum:]] is equivalent to [0-9A-Z-a-z].  Unfortunately, C# doesn't seem to recognize the second of those two shortcuts.  Gordon McKinney has a good reference sheet for regular expressions.

The next step was to look for file names that were matched by the previous regular expression but have a lowercase letter in them.  The easy way to do this would be to save the list of matches from the first expression and then match against the following expression to look for any lowercase letters:
[a-z]
or
[[:lower:]]

Simple, right?  Except that my requirements were to accomplish this all in one regular expression.  I experimented with a few things before finally stumbling on lookarounds.  I never could get my solution working with a positive lookahead (I think the problem was with the starting anchor), but I did get it working with a positive lookbehind:
^\d{3}-\d{4}-[[:alnum:]]\d{2}-[[:alnum:]]{2}-[[:alnum:]]{1,3}(?<=[[:lower:]])

This expression is first looking for the above-mentioned valid naming scheme, and then it looks back through the regular expression for anything matching [[:lower:]].  Upon finding a match, the entire expression is now considered matched.

No comments:

Post a Comment