Patterns are compiled regular expressions. In many cases, convenience methods such as
String#matches,
String#replaceAll and
String#split will be preferable, but if you need to do a lot of work
with the same regular expression, it may be more efficient to compile it once and reuse it.
The
Pattern class and its companion,
Matcher, also offer more functionality
than the small amount exposed by
String.
// String convenience methods:
boolean sawFailures = s.matches("Failures: \\d+");
String farewell = s.replaceAll("Hello, (\\S+)", "Goodbye, $1");
String[] fields = s.split(":");
// Direct use of Pattern:
Pattern p = Pattern.compile("Hello, (\\S+)");
Matcher m = p.matcher(inputString);
while (m.find()) { // Find each match in turn; String can't do this.
String name = m.group(1); // Access a submatch group; String can't do this.
}
Regular expression syntax
Java supports a subset of Perl 5 regular expression syntax. An important gotcha is that Java
has no regular expression literals, and uses plain old string literals instead. This means that
you need an extra level of escaping. For example, the regular expression
\s+ has to
be represented as the string
"\\s+".
Escape sequences
\ | Quote the following metacharacter (so
\. matches a literal
.). |
\Q | Quote all following metacharacters until
\E. |
\E | Stop quoting metacharacters (started by
\Q). |
\\ | A literal backslash. |
\uhhhh | The Unicode character U+hhhh (in hex). |
\xhh | The Unicode character U+00hh (in hex). |
\cx | The ASCII control character ^x (so
\cH would be ^H, U+0008). |
\a | The ASCII bell character (U+0007). |
\e | The ASCII ESC character (U+001b). |
\f | The ASCII form feed character (U+000c). |
\n | The ASCII newline character (U+000a). |
\r | The ASCII carriage return character (U+000d). |
\t | The ASCII tab character (U+0009). |
Character classes
It's possible to construct arbitrary character classes using set operations:
[abc] | Any one of
a,
b, or
c. (Enumeration.) |
[a-c] | Any one of
a,
b, or
c. (Range.) |
[^abc] | Any character except
a,
b, or
c. (Negation.) |
[[a-f][0-9]] | Any character in either range. (Union.) |
[[a-z]&&[jkl]] | Any character in both ranges. (Intersection.) |
Most of the time, the built-in character classes are more useful:
\d | Any digit character (see note below). |
\D | Any non-digit character (see note below). |
\s | Any whitespace character (see note below). |
\S | Any non-whitespace character (see note below). |
\w | Any word character (see note below). |
\W | Any non-word character (see note below). |
\p{NAME} | Any character in the class with the given NAME. |
\P{NAME} | Any character not in the named class. |
Note that these built-in classes don't just cover the traditional ASCII range. For example,
\w
is equivalent to the character class [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]
.
For more details see Unicode TR-18,
and bear in mind that the set of characters in each class can vary between Unicode releases.
If you actually want to match only ASCII characters, specify the explicit characters you want;
if you mean 0-9 use [0-9]
rather than \d
, which would also include
Gurmukhi digits and so forth.
There are also a variety of named classes:
- Unicode category names,
prefixed by
Is. For example
\p{IsLu}} for all uppercase letters.
- POSIX class names. These are 'Alnum', 'Alpha', 'ASCII', 'Blank', 'Cntrl', 'Digit',
'Graph', 'Lower', 'Print', 'Punct', 'Upper', 'XDigit'.
- Unicode block names, as accepted as input to
java.lang.Character.UnicodeBlock#forName,
prefixed by
In. For example
\p{InHebrew}} for all characters in the Hebrew block.
- Character method names. These are all non-deprecated methods from
java.lang.Characterwhose name starts with
is, but with the
is replaced by
java.
For example,
\p{javaLowerCase}}.
Quantifiers
Quantifiers match some number of instances of the preceding regular expression.
* | Zero or more. |
? | Zero or one. |
+ | One or more. |
{n} | Exactly n. |
{n,} | At least n. |
{n,m} | At least n but not more than m. |
Quantifiers are "greedy" by default, meaning that they will match the longest possible input
sequence. There are also non-greedy quantifiers that match the shortest possible input sequence.
They're same as the greedy ones but with a trailing
?:
*? | Zero or more (non-greedy). |
?? | Zero or one (non-greedy). |
+? | One or more (non-greedy). |
{n}? | Exactly n (non-greedy). |
{n,}? | At least n (non-greedy). |
{n,m}? | At least n but not more than m (non-greedy). |
Quantifiers allow backtracking by default. There are also possessive quantifiers to prevent
backtracking. They're same as the greedy ones but with a trailing
+:
*+ | Zero or more (possessive). |
?+ | Zero or one (possessive). |
++ | One or more (possessive). |
{n}+ | Exactly n (possessive). |
{n,}+ | At least n (possessive). |
{n,m}+ | At least n but not more than m (possessive). |
Zero-width assertions
^ | At beginning of line. |
$ | At end of line. |
\A | At beginning of input. |
\b | At word boundary. |
\B | At non-word boundary. |
\G | At end of previous match. |
\z | At end of input. |
\Z | At end of input, or before newline at end. |
Look-around assertions
Look-around assertions assert that the subpattern does (positive) or doesn't (negative) match
after (look-ahead) or before (look-behind) the current position, without including the matched
text in the containing match. The maximum length of possible matches for look-behind patterns
must not be unbounded.
(?=a) | Zero-width positive look-ahead. |
(?!a) | Zero-width negative look-ahead. |
(?<=a) | Zero-width positive look-behind. |
(?<!a) | Zero-width negative look-behind. |
Groups
(a) | A capturing group. |
(?:a) | A non-capturing group. |
(?>a) | An independent non-capturing group. (The first match of the subgroup is the only match tried.) |
\n | The text already matched by capturing group n. |
See
Matcher#group for details of how capturing groups are numbered and accessed.
Operators
ab | Expression a followed by expression b. |
a|b | Either expression a or expression b. |
Flags
(?dimsux-dimsux:a) | Evaluates the expression a with the given flags enabled/disabled. |
(?dimsux-dimsux) | Evaluates the rest of the pattern with the given flags enabled/disabled. |
The flags are:
i |
#CASE_INSENSITIVE | case insensitive matching |
d |
#UNIX_LINES | only accept
'\n' as a line terminator |
m |
#MULTILINE | allow
^ and
$ to match beginning/end of any line |
s |
#DOTALL | allow
. to match
'\n' ("s" for "single line") |
u |
#UNICODE_CASE | enable Unicode case folding |
x |
#COMMENTS | allow whitespace and comments |
Either set of flags may be empty. For example,
(?i-m) would turn on case-insensitivity
and turn off multiline mode,
(?i) would just turn on case-insensitivity,
and
(?-m) would just turn off multiline mode.
Note that on Android,
UNICODE_CASE is always on: case-insensitive matching will
always be Unicode-aware.
There are two other flags not settable via this mechanism:
#CANON_EQ and
#LITERAL. Attempts to use
#CANON_EQ on Android will throw an exception.
Implementation notes
The regular expression implementation used in Android is provided by
ICU. The notation for the regular
expressions is mostly a superset of those used in other Java language
implementations. This means that existing applications will normally work as
expected, but in rare cases Android may accept a regular expression that is
not accepted by other implementations.
In some cases, Android will recognize that a regular expression is a simple
special case that can be handled more efficiently. This is true of both the convenience methods
in
String and the methods in
Pattern.