Regular Expressions

Basic concepts

A regular expression, often called a pattern, is an expression that specifies a set of strings. It is more concise to specify a set's members by rules (such as a pattern) than by a list. For example, the set containing the three strings "Handel", "Händel", and "Haendel" can be specified by the pattern H(ä|ae?)ndel (or alternatively, it is said that the pattern matches each of the three strings). In most formalisms, if there exists at least one regex that matches a particular set then there exist an infinite number of such expressions. Most formalisms provide the following operations to construct regular expressions.

Boolean "or"

A vertical bar separates alternatives. For example, gray|grey can match "gray" or "grey".

Grouping

Parentheses are used to define the scope and precedence of the operators (among other uses). For example, gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" and "grey".

Quantification

A quantifier after a token (such as a character) or group specifies how often that preceding element is allowed to occur. The most common quantifiers are the question mark ?, the asterisk * (derived from the Kleene star), and the plus sign + (Kleene cross).

These constructions can be combined to form arbitrarily complex expressions, much like one can construct arithmetical expressions from numbers and the operations +, -, ×, and ÷. For example, H(ae?|ä)ndel and H(a|ae|ä)ndel are both valid patterns which match the same strings as the earlier example, H(ä|ae?)ndel.

Syntax

A number of special characters or meta characters are used to denote actions or delimit groups; but it is possible to force these special characters to be interpreted as normal characters by preceding them with a defined escape character, usually the backslash "\". For example, a dot is normally used as a "wild card" metacharacter to denote any character, but if preceded by a backslash it represents the dot character itself. The pattern c.t matches "cat", "cot", "cut", and non-words such as "czt" and "c.t"; but c\.t matches only "c.t". The backslash also escapes itself, i.e., two backslashes are interpreted as a literal backslash character.

POSIX Basic Regular Expressions

Traditional Unix regular expression syntax followed common conventions but often differed from tool to tool. The IEEE POSIX Basic Regular Expressions (BRE) standard (released alongside an alternative flavor called Extended Regular Expressions or ERE) was designed mostly for backward compatibility with the traditional (Simple Regular Expression) syntax but provided a common standard which has since been adopted as the default syntax of many Unix regular expression tools, though there is often some variation or additional features. Many such tools also provide support for ERE syntax with command line arguments.

In the BRE syntax, most characters are treated as literals — they match only themselves (e.g., a matches "a"). The exceptions, listed below, are called metacharacters or metasequences.

Examples:

POSIX Extended Regular Expressions

The meaning of metacharacters escaped with a backslash is reversed for some characters in the POSIX Extended Regular Expression (ERE) syntax. With this syntax, a backslash causes the metacharacter to be treated as a literal character. So, for example, \( \) is now ( ) and \{ \} is now { }. Additionally, support is removed for \n backreferences and the following metacharacters are added:

Examples:

POSIX Extended Regular Expressions can often be used with modern Unix utilities by including the command line flag -E.

Since many ranges of characters depend on the chosen locale setting (i.e., in some settings letters are organized as abc...zABC...Z, while in some others as aAbBcC...zZ), the POSIX standard defines some classes or categories of characters as shown in the following table:

ASCII

[A-Za-z0-9]
[A-Za-z0-9_]
[^A-Za-z0-9_]
[A-Za-z]
[ \t]
[(?<=\W)(?=\w)|(?<=\w)(?=\W)]
[\]\[!"#$%&'()*+,./:;<=>?@\^_`{|}~-]
[\x00-\x1F\x7F]
[0-9]
[^0-9]
[\x21-\x7E]
[a-z]
[\x20-\x7E]
[ \t\r\n\v\f]
[^ \t\r\n\v\f]
[A-Z]
[A-Fa-f0-9]

Description

Alphanumeric characters
Alphanumeric characters plus "_"
Non-word characters
Alphabetic characters
Space and tab
Word boundaries
Control characters
Digits
Non-digits
Visible characters
Lowercase letters
Visible characters and the space character
Punctuation characters
Whitespace characters
Non-whitespace characters
Uppercase letters
Hexadecimal digits

POSIX character classes can only be used within bracket expressions. For example, [[:upper:]ab] matches the uppercase letters and lowercase "a" and "b".

An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers. The editor Vim further distinguishes word and word-head classes (using the notation \w and \h) since in many programming languages the characters that can begin an identifier are not the same as those that can occur in other positions.

Note that what the POSIX regular expression standards call character classes are commonly referred to as POSIX character classes in other regular expression flavors which support them. With most other regular expression flavors, the term character class is used to describe what POSIX calls bracket expressions.