Regular expression
A regular expression (regex) is a sequence of characters that define a pattern. Such a pattern is usually used to check if a string matches a pattern, for text parsing or for search/replace operations in text.
Perl had a huge influence on most modern engines today. So modern regular expressions are often called Perl-style. There is also the PCRE library, that implement regular expression pattern matching using the same syntax and semantics as Perl 5.
The regex implementation may not include some features. For example Java does not support recursive patterns. JavaScript supports named capture-groups since ES8. And etc.
\b(?<city>[A-Za-z\\s]+),\s(?<state>[A-Z]{2,2}):\s(?<areaCode>[0-9]{3,3})\b
characters
character | description |
---|---|
x | The character x. |
\cx | The control character corresponding to x. |
\\ | The backslash character. |
\t | The tab character ('\u0009'). |
\n | The newline (line feed) character ('\u000A'). |
\r | The carriage-return character ('\u000D'). |
\f | The form-feed character ('\u000C'). |
\a | The alert character ('\u0007'). |
\e | The escape character ('\u001B'). |
\0n | The character with octal value 0n, where n is in the range [0; 7]. |
\0nn | The character with octal value 0nn, where n is in the range [0; 7]. |
\0mnn | The character with octal value 0nn, where m is in the range [0; 3] and n is in the range [0; 7]. |
\xhh | The character with hexadecimal value 0xhh. |
\uhhhh | The character with hexadecimal value 0xhhhh. |
\x{h...h} | The character with hexadecimal value 0xh...h (in java Character.MIN_CODE_POINT <= 0xh...h <= Character.MAX_CODE_POINT) |
character classes
Character class define a set of characters as one unit. Character class is specified in the square brackets. Nesting is allowed.
class example | description |
---|---|
[abc] | A simple class defined as a sequence of characters. For example a,b,c characters. |
[^abc] | Negation. Any character except a, b, or c. |
[a-z] | Range. For example, a through z inclusive. |
[a-d[m-p]] | Union. For example, a through d, or m through p. This is same as [a-dm-p] |
[a-z&&[def]] | Intersection. For example, d, e, or f. This is same as [def] |
[a-z&&[^bc]] [a-z&&[^m-p]] |
Subtraction. First example means a through z, except for b and c. This is same as [ad-z]. Second example means a through z, and not m through p. This is same as [a-lq-z]. |
predefined character classes
- . - any character
- \d - a digit, same as [0-9]
- \D - a non-digit, same as [^0-9]
- \s - a whitespace character, same as [ \t\n\x0B\f\r]
- \S - a non-whitespace character, same as [^\s]
- \w - a word character, same as [a-zA-Z_0-9]
- \W - a non-word character, same as [^\w]
predefined posix character classes
The posix character classes can be used to support Unicode. Some classes in Java have aliases.
- \p{Nd} - digit (Java alias is \p{Digit})
- \p{N} - any kind of numeric character in any script
- \p{L} - letter (for \p{L}\p{Nl} Java alias is \p{Alpha})
- \p{Lu} - uppercase letter (Java alias is \p{Upper})
- \p{Ll} - lowercase letter (Java alias is \p{Lower})
- \p{Cc} - control character ( Java alias is \p{Cntrl}
In Java 8 and prior, it does not matter whether you use the Is prefix with the \p syntax or not. So in Java 8, \p{Alnum} and \p{IsAlnum} are identical.
In Java 9+ there is a difference. Without the Is prefix, the behavior is exactly the same as in previous versions of Java. The syntax with the Is prefix now matches Unicode characters too. For \p{IsPunct} this also means that it no longer matches the ASCII characters that are in the Symbol Unicode category.
quantifiers
Quantifier indicates the scope of preceding element. There are three kinds of quantifiers:
- greedy quantifiers - for finding the longest matching group
- reluctant quantifiers - for finding the shortest matching group
- possessive quantifiers - for finding the longest matching group or immediately stop after first failure occur. This is primarily useful for performance reasons.
What is the difference between possessive and greedy quantifiers? The regex engine will backtrack to try all possible regex permutations if no matches are found. But with possessive quantifiers the match attempt fails immediately when first failure occurs.
greedy | reluctant | possessive | description |
---|---|---|---|
X? | X?? | X?+ | X, once or not at all. |
X* | X*? | X*+ | X, zero or more times. |
X+ | X+? | X++ | X, one or more times. |
X{n} | X{n}? | X{n}+ | X, exactly n times. |
X{n,} | X{n,}? | X{n,}+ | X, at least n times. |
X{n,m} | X{n,m}? | X{n,m}+ | X, at least n but not more than m times. |
groups
Regular expression inside round brackets will be considered as group. You can apply quantifiers to group. You can capture the result of group matching and refer to it later.
matcher | description |
---|---|
(X) | X, as a capturing group. The group will be assigned a number from 1. Group zero always stands for the entire expression. |
(?<name>X) | X, as a named-capturing group. |
(?:X) | X, as a non-capturing group (result not will be saved). |
(?>X) | X, as an independent, non-capturing group. |
\n | Whatever the nth capturing group matched. For example, (a|b)c\1 would match either "aca" or "bcb" and would not match, for example, "acb". |
\k<name> | Whatever the named-capturing group "name" matched. |
boundary matchers
matcher | description |
---|---|
^ | The beginning of a line. Don't confuse with ^ inside square brackets. |
$ | The end of a line. |
\b | A word boundary. |
\B | A non-word boundary. |
\A | The beginning of the input. |
\G | The end of the previous match. |
\Z | The end of the input but for the final terminator, if any. |
\z | The end of the input. |
logical operators
operator | description |
---|---|
XY | X followed by Y. |
X|Y | Either X or Y. |
quotation
matcher | description |
---|---|
\Q | Quotes all characters until \E. |
\E | Ends quoting started by \Q. |
lookaround
The word positive means that you want the expression to match, while the word negative means that you don't want the expression to match.
Lookahead means that you want to search to the right of your current position in the input string. Lookbehind means that you want to search to the left.
Lookaround constructions does not change the current position in the input string.
matcher | description |
---|---|
(?=X) | X, via zero-width positive lookahead. |
(?!X) | X, via zero-width negative lookahead. |
(?<=X) | X, via zero-width positive lookbehind. |
(?<!X) | X, via zero-width negative lookbehind. |
flags
Flags define some parameters of the regular expression engine.
matcher | description |
---|---|
(?idmsuxU-idmsuxU) | idmsuxU is allowed flags (in Java). Flags before character - are turned on, after are turned off. You can omit unnecessary flags. |
(?idmsux-idmsux:X) | X, as a non-capturing group with the given flags on/off. |
flag | description |
---|---|
i | Case insensitivity. By default, all major regex engines match in case-sensitive mode. For working with unicode characters, you must turn on unicode mode. |
x | Turn on free-spacing mode. In this mode whitespace between tokens is ignored. For example a b c will be same as "abc". In this mode you can use single line comments, that start with #. |
m | Turn on multi-line mode. In this mode ^ and $ will be used to match at the start and end of each line. By default, they only match at the start/end of the entire string. |
s | Turn on dot-all mode. In this mode line terminators (\n or \r or \r\n) are treated as literal. The dot (.) matcher in regex expression can match them as well. By default, the line terminators are the only ones dot doesn't match. |
u | Enables Unicode-aware case folding. In this mode case of unicode characters will be checked. Useful with flag i. |
U | Enables the Unicode version of predefined character classes and POSIX character classes. |
g | Turns on global mode. In this mode the search looks for all matches, otherwise only the first match is returned. This flag not supported in Java. But you can achieve this by repeating invocations of the Mather.find() method (it will resume where the last match left off, unless the matcher is reset). |