With both maximizing ("greedy") and minimizing ("ungreedy" or
"lazy") repetition, failure of what follows normally causes the
repeated item to be re-evaluated to see if a different number of
repeats allows the rest of the pattern to match. Sometimes it is
useful to prevent this, either to change the nature of the match,
or to cause it fail earlier than it otherwise might, when the
author of the pattern knows there is no point in carrying on.
Consider, for example, the pattern \d+foo when applied to the
subject line
123456bar
After matching all 6 digits and then failing to match "foo", the
normal action of the matcher is to try again with only 5 digits
matching the \d+ item, and then with 4, and so on, before
ultimately failing. "Atomic grouping" (a term taken from Jeffrey
Friedl's book) provides the means for specifying that once a
subpattern has matched, it is not to be re-evaluated in this way.
If we use atomic grouping for the previous example, the matcher
gives up immediately on failing to match "foo" the first time.
The notation is a kind of special parenthesis, starting with (?>
as in this example:
(?>\d+)foo
This kind of parenthesis "locks up" the part of the pattern it
contains once it has matched, and a failure further into the
pattern is prevented from backtracking into it. Backtracking past
it to previous items, however, works as normal.
An alternative description is that a subpattern of this type
matches the string of characters that an identical standalone
pattern would match, if anchored at the current point in the
subject string.
Atomic grouping subpatterns are not capturing subpatterns. Simple
cases such as the above example can be thought of as a maximizing
repeat that must swallow everything it can. So, while both \d+
and \d+? are prepared to adjust the number of digits they match
in order to make the rest of the pattern match, (?>\d+) can only
match an entire sequence of digits.
Atomic groups in general can of course contain arbitrarily
complicated subpatterns, and can be nested. However, when the
subpattern for an atomic group is just a single repeated item, as
in the example above, a simpler notation, called a "possessive
quantifier" can be used. This consists of an additional +
character following a quantifier. Using this notation, the
previous example can be rewritten as
\d++foo
Note that a possessive quantifier can be used with an entire
group, for example:
(abc|xyz){2,3}+
Possessive quantifiers are always greedy; the setting of the
PCRE_UNGREEDY option is ignored. They are a convenient notation
for the simpler forms of atomic group. However, there is no
difference in the meaning of a possessive quantifier and the
equivalent atomic group, though there may be a performance
difference; possessive quantifiers should be slightly faster.
The possessive quantifier syntax is an extension to the Perl 5.8
syntax. Jeffrey Friedl originated the idea (and the name) in the
first edition of his book. Mike McCloskey liked it, so
implemented it when he built Sun's Java package, and PCRE copied
it from there. It ultimately found its way into Perl at release
5.10.
PCRE has an optimization that automatically "possessifies"
certain simple pattern constructs. For example, the sequence A+B
is treated as A++B because there is no point in backtracking into
a sequence of A's when B must follow.
When a pattern contains an unlimited repeat inside a subpattern
that can itself be repeated an unlimited number of times, the use
of an atomic group is the only way to avoid some failing matches
taking a very long time indeed. The pattern
(\D+|<\d+>)*[!?]
matches an unlimited number of substrings that either consist of
non-digits, or digits enclosed in <>, followed by either ! or ?.
When it matches, it runs quickly. However, if it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
it takes a long time before reporting failure. This is because
the string can be divided between the internal \D+ repeat and the
external * repeat in a large number of ways, and all have to be
tried. (The example uses [!?] rather than a single character at
the end, because both PCRE and Perl have an optimization that
allows for fast failure when a single character is used. They
remember the last single character that is required for a match,
and fail early if it is not present in the string.) If the
pattern is changed so that it uses an atomic group, like this:
((?>\D+)|<\d+>)*[!?]
sequences of non-digits cannot be broken, and failure happens
quickly.