Thursday, July 16, 2009

Regular Expressions

Regular expressions are the key to powerful, flexible, and efficient text processing. Regular expressions themselves, with a general pattern notation almost like a mini programming language, allow you to describe and parse text. With additional support provided by the particular tool being used, regular expressions can add, remove, isolate, and generally fold, spindle, and mutilate all kinds of text and data. It might be as simple as a text editor's search command or as powerful as a full text processing language.
The Filename Analogy
You know that report.txt is a specific filename, and the pattern "*.txt" can be used to select multiple files. With such filename patterns like this (called file globs), there are a few characters "*" that have special meanings. The star means "match anything", and a question mark "?" means "match any one character". With "*.txt", we start with a match-anything "*" and end with the literal ".txt" , so we end up with a pattern that means "select the files whose names start with anything and end with .txt".
The Language Analogy
Full regular expressions are composed of two types of characters. The special characters (like the * from the filename analogy) are called metacharacters, while everything else are called literal, or normal text characters. What sets regular expressions apart from filename patterns is the scope of power their metacharacters provide. Filename patterns provide limited metacharacters for limited needs, but a regular expression "language" provides rich and expressive metacharacters for advanced uses. It might help to consider regular expressions as their own language, with literal text acting as the words and metacharacters as the grammar. The words are combined with grammar according to a set of rules to create an expression which communicates an idea. For example, the expression used to find lines beginning with "From:" or "Subject:" is written as ^(From|Subject):.
A very simple case of a regular expression in this syntax would be to locate the same word spelled two different ways in a text editor, the regular expression seriali[sz]e matches both "serialise" and "serialize". Wildcards could also achieve this, but are more limited in what they can pattern (having fewer metacharacters and a simple language-base).
The usual context of wildcard characters is in globbing similar names in a list of files, whereas regexps are usually employed in applications that pattern-match text strings in general. For example, the regexp ^[ \t]+|[ \t]+$ matches excess whitespace at the beginning or end of a line. An advanced regexp used to match any numeral is ^[+-]?(\d+(\.\d+)?|\.\d+)([eE][+-]?\d+)?$.