Regular expression

A regular expression (regex) is a sequence of characters that define a pattern. Such a pattern is usually used to check if a string matches a pattern, for text parsing or for search/replace operations in text.

Perl had a huge influence on most modern engines today. So modern regular expressions are often called Perl-style. There is also the PCRE library, that implement regular expression pattern matching using the same syntax and semantics as Perl 5.

The regex implementation may not include some features. For example Java does not support recursive patterns. JavaScript supports named capture-groups since ES8. And etc.

\b(?<city>[A-Za-z\\s]+),\s(?<state>[A-Z]{2,2}):\s(?<areaCode>[0-9]{3,3})\b

characters

character description
x The character x.
\cx The control character corresponding to x.
\\ The backslash character.
\t The tab character ('\u0009').
\n The newline (line feed) character ('\u000A').
\r The carriage-return character ('\u000D').
\f The form-feed character ('\u000C').
\a The alert character ('\u0007').
\e The escape character ('\u001B').
\0n The character with octal value 0n, where n is in the range [0; 7].
\0nn The character with octal value 0nn, where n is in the range [0; 7].
\0mnn The character with octal value 0nn, where m is in the range [0; 3] and n is in the range [0; 7].
\xhh The character with hexadecimal value 0xhh.
\uhhhh The character with hexadecimal value 0xhhhh.
\x{h...h} The character with hexadecimal value 0xh...h (in java Character.MIN_CODE_POINT <= 0xh...h <= Character.MAX_CODE_POINT)

character classes

Character class define a set of characters as one unit. Character class is specified in the square brackets. Nesting is allowed.

class example description
[abc] A simple class defined as a sequence of characters. For example a,b,c characters.
[^abc] Negation. Any character except a, b, or c.
[a-z] Range. For example, a through z inclusive.
[a-d[m-p]] Union. For example, a through d, or m through p. This is same as [a-dm-p]
[a-z&&[def]] Intersection. For example, d, e, or f. This is same as [def]
[a-z&&[^bc]]
[a-z&&[^m-p]]
Subtraction. First example means a through z, except for b and c. This is same as [ad-z].
Second example means a through z, and not m through p. This is same as [a-lq-z].

predefined character classes

  • . - any character
  • \d - a digit, same as [0-9]
  • \D - a non-digit, same as [^0-9]
  • \s - a whitespace character, same as [ \t\n\x0B\f\r]
  • \S - a non-whitespace character, same as [^\s]
  • \w - a word character, same as [a-zA-Z_0-9]
  • \W - a non-word character, same as [^\w]

predefined posix character classes

The posix character classes can be used to support Unicode. Some classes in Java have aliases.

  • \p{Nd} - digit (Java alias is \p{Digit})
  • \p{N} - any kind of numeric character in any script
  • \p{L} - letter (for \p{L}\p{Nl} Java alias is \p{Alpha})
  • \p{Lu} - uppercase letter (Java alias is \p{Upper})
  • \p{Ll} - lowercase letter (Java alias is \p{Lower})
  • \p{Cc} - control character ( Java alias is \p{Cntrl}

In Java 8 and prior, it does not matter whether you use the Is prefix with the \p syntax or not. So in Java 8, \p{Alnum} and \p{IsAlnum} are identical.

In Java 9+ there is a difference. Without the Is prefix, the behavior is exactly the same as in previous versions of Java. The syntax with the Is prefix now matches Unicode characters too. For \p{IsPunct} this also means that it no longer matches the ASCII characters that are in the Symbol Unicode category.

quantifiers

Quantifier indicates the scope of preceding element. There are three kinds of quantifiers:

  • greedy quantifiers - for finding the longest matching group
  • reluctant quantifiers - for finding the shortest matching group
  • possessive quantifiers - for finding the longest matching group or immediately stop after first failure occur. This is primarily useful for performance reasons.

What is the difference between possessive and greedy quantifiers? The regex engine will backtrack to try all possible regex permutations if no matches are found. But with possessive quantifiers the match attempt fails immediately when first failure occurs.

greedy reluctant possessive description
X? X?? X?+ X, once or not at all.
X* X*? X*+ X, zero or more times.
X+ X+? X++ X, one or more times.
X{n} X{n}? X{n}+ X, exactly n times.
X{n,} X{n,}? X{n,}+ X, at least n times.
X{n,m} X{n,m}? X{n,m}+ X, at least n but not more than m times.

groups

Regular expression inside round brackets will be considered as group. You can apply quantifiers to group. You can capture the result of group matching and refer to it later.

matcher description
(X) X, as a capturing group. The group will be assigned a number from 1. Group zero always stands for the entire expression.
(?<name>X) X, as a named-capturing group.
(?:X) X, as a non-capturing group (result not will be saved).
(?>X) X, as an independent, non-capturing group.
\n Whatever the nth capturing group matched. For example, (a|b)c\1 would match either "aca" or "bcb" and would not match, for example, "acb".
\k<name> Whatever the named-capturing group "name" matched.

boundary matchers

matcher description
^ The beginning of a line. Don't confuse with ^ inside square brackets.
$ The end of a line.
\b A word boundary.
\B A non-word boundary.
\A The beginning of the input.
\G The end of the previous match.
\Z The end of the input but for the final terminator, if any.
\z The end of the input.

logical operators

operator description
XY X followed by Y.
X|Y Either X or Y.

quotation

matcher description
\Q Quotes all characters until \E.
\E Ends quoting started by \Q.

lookaround

The word positive means that you want the expression to match, while the word negative means that you don't want the expression to match.

Lookahead means that you want to search to the right of your current position in the input string. Lookbehind means that you want to search to the left.

Lookaround constructions does not change the current position in the input string.

matcher description
(?=X) X, via zero-width positive lookahead.
(?!X) X, via zero-width negative lookahead.
(?<=X) X, via zero-width positive lookbehind.
(?<!X) X, via zero-width negative lookbehind.

flags

Flags define some parameters of the regular expression engine.

matcher description
(?idmsuxU-idmsuxU) idmsuxU is allowed flags (in Java). Flags before character - are turned on, after are turned off. You can omit unnecessary flags.
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags on/off.
flag description
i Case insensitivity. By default, all major regex engines match in case-sensitive mode. For working with unicode characters, you must turn on unicode mode.
x Turn on free-spacing mode. In this mode whitespace between tokens is ignored. For example a b c will be same as "abc". In this mode you can use single line comments, that start with #.
m Turn on multi-line mode. In this mode ^ and $ will be used to match at the start and end of each line. By default, they only match at the start/end of the entire string.
s Turn on dot-all mode. In this mode line terminators (\n or \r or \r\n) are treated as literal. The dot (.) matcher in regex expression can match them as well. By default, the line terminators are the only ones dot doesn't match.
u Enables Unicode-aware case folding. In this mode case of unicode characters will be checked. Useful with flag i.
U Enables the Unicode version of predefined character classes and POSIX character classes.
g Turns on global mode. In this mode the search looks for all matches, otherwise only the first match is returned.
This flag not supported in Java. But you can achieve this by repeating invocations of the Mather.find() method (it will resume where the last match left off, unless the matcher is reset).