Wednesday, September 23, 2009

Regular Expression Engines

A regular expression is a string containing a combination of normal characters and special metacharacters or metasequences. The normal characters match themselves. Metacharacters and metasequences are characters or sequences of characters that represent ideas such as quantity, locations, or types of characters.

Pattern matching consists of finding asection of text that is described (matched) by a regular expression. The underlying code that searches the text is the regular expression engine. You can predict the results of most matches by keeping two rules in mind:
1. The earliest (leftmost) match wins
Regular expressions are applied to the input starting at the first character and proceeding toward the last. As soon as the regular expression engine finds a match, it returns.
2. Standard quantifiers are greedy
Quantifiers specify how many times something can be repeated. The standard quantifiers attempt to match as many times as possible. They settle for less than the maximum only if this is necessary for the success of the match. The process of giving up characters and trying less-greedy matches is called backtracking.

Regular expression engines have differences based on their type. There are two classes of engines: Deterministic Finite Automaton (DFA) and Nondeterministic Finite Automaton (NFA). DFAs are faster, but lack many of the features of an NFA, such as capturing, lookaround, and nongreedy quantifiers. In the NFA world, there are two types: traditional and POSIX.

DFA engines
DFAs compare each character of the input string to the regular expression, keeping track of all matches in progress. Since each character is examined at most once, the DFA engine is the fastest. One additional rule to remember with DFAs is that the alternation metasequence is greedy. When more than one option in an
alternation (foo|foobar) matches, the longest one is selected. So, rule No. 1 can be amended to read "the longest leftmost match wins".

Traditional NFA engines
Traditional NFA engines compare each element of the regex to the input string, keeping track of positions where it chose between two options in the regex. If an
option fails, the engine backtracks to the most recently saved position. For standard quantifiers, the engine chooses the greedy option of matching more text; however, if that option leads to the failure of the match, the engine returns to a saved position and tries a less greedy path. The traditional NFA engine uses ordered alternation, where each option in the alternation is tried sequentially. A longer match may be ignored if an earlier option leads to a successful match. So, here rule #1 can be amended to read "the first leftmost match after greedy quantifiers have had their fill wins".

POSIX NFA engines
POSIX NFA Engines work similarly to Traditional NFAs with one exception: a POSIX engine always picks the longest of the leftmost matches. For example, the alternation cat|category would match the full word "category" whenever possible, even if the first alternative ("cat") matched and appeared earlier in the alternation.