An assertion is a test on the characters following or preceding
the current matching point that does not actually consume any
characters. The simple assertions coded as \b, \B, \A, \G, \Z,
\z, ^ and $ are described above.
More complicated assertions are coded as subpatterns. There are
two kinds: those that look ahead of the current position in the
subject string, and those that look behind it. An assertion
subpattern is matched in the normal way, except that it does not
cause the current matching position to be changed.
Assertion subpatterns are not capturing subpatterns. If such an
assertion contains capturing subpatterns within it, these are
counted for the purposes of numbering the capturing subpatterns
in the whole pattern. However, substring capturing is carried out
only for positive assertions. (Perl sometimes, but not always,
does do capturing in negative assertions.)
WARNING: If a positive assertion containing one or more capturing
subpatterns succeeds, but failure to match later in the pattern
causes backtracking over this assertion, the captures within the
assertion are reset only if no higher numbered captures are
already set. This is, unfortunately, a fundamental limitation of
the current implementation, and as PCRE1 is now in maintenance-
only status, it is unlikely ever to change.
For compatibility with Perl, assertion subpatterns may be
repeated; though it makes no sense to assert the same thing
several times, the side effect of capturing parentheses may
occasionally be useful. In practice, there only three cases:
(1) If the quantifier is {0}, the assertion is never obeyed
during matching. However, it may contain internal capturing
parenthesized groups that are called from elsewhere via the
subroutine mechanism.
(2) If quantifier is {0,n} where n is greater than zero, it is
treated as if it were {0,1}. At run time, the rest of the pattern
match is tried with and without the assertion, the order
depending on the greediness of the quantifier.
(3) If the minimum repetition is greater than zero, the
quantifier is ignored. The assertion is obeyed just once when
encountered during matching.
Lookahead assertions
Lookahead assertions start with (?= for positive assertions and
(?! for negative assertions. For example,
\w+(?=;)
matches a word followed by a semicolon, but does not include the
semicolon in the match, and
foo(?!bar)
matches any occurrence of "foo" that is not followed by "bar".
Note that the apparently similar pattern
(?!foo)bar
does not find an occurrence of "bar" that is preceded by
something other than "foo"; it finds any occurrence of "bar"
whatsoever, because the assertion (?!foo) is always true when the
next three characters are "bar". A lookbehind assertion is needed
to achieve the other effect.
If you want to force a matching failure at some point in a
pattern, the most convenient way to do it is with (?!) because an
empty string always matches, so an assertion that requires there
not to be an empty string must always fail. The backtracking
control verb (*FAIL) or (*F) is a synonym for (?!).
Lookbehind assertions
Lookbehind assertions start with (?<= for positive assertions and
(?<! for negative assertions. For example,
(?<!foo)bar
does find an occurrence of "bar" that is not preceded by "foo".
The contents of a lookbehind assertion are restricted such that
all the strings it matches must have a fixed length. However, if
there are several top-level alternatives, they do not all have to
have the same fixed length. Thus
(?<=bullock|donkey)
is permitted, but
(?<!dogs?|cats?)
causes an error at compile time. Branches that match different
length strings are permitted only at the top level of a
lookbehind assertion. This is an extension compared with Perl,
which requires all branches to match the same length of string.
An assertion such as
(?<=ab(c|de))
is not permitted, because its single top-level branch can match
two different lengths, but it is acceptable to PCRE if rewritten
to use two top-level branches:
(?<=abc|abde)
In some cases, the escape sequence \K (see above) can be used
instead of a lookbehind assertion to get round the fixed-length
restriction.
The implementation of lookbehind assertions is, for each
alternative, to temporarily move the current position back by the
fixed length and then try to match. If there are insufficient
characters before the current position, the assertion fails.
In a UTF mode, PCRE does not allow the \C escape (which matches a
single data unit even in a UTF mode) to appear in lookbehind
assertions, because it makes it impossible to calculate the
length of the lookbehind. The \X and \R escapes, which can match
different numbers of data units, are also not permitted.
"Subroutine" calls (see below) such as (?2) or (?&X) are
permitted in lookbehinds, as long as the subpattern matches a
fixed-length string. Recursion, however, is not supported.
Possessive quantifiers can be used in conjunction with lookbehind
assertions to specify efficient matching of fixed-length strings
at the end of subject strings. Consider a simple pattern such as
abcd$
when applied to a long string that does not match. Because
matching proceeds from left to right, PCRE will look for each "a"
in the subject and then see if what follows matches the rest of
the pattern. If the pattern is specified as
^.*abcd$
the initial .* matches the entire string at first, but when this
fails (because there is no following "a"), it backtracks to match
all but the last character, then all but the last two characters,
and so on. Once again the search for "a" covers the entire
string, from right to left, so we are no better off. However, if
the pattern is written as
^.*+(?<=abcd)
there can be no backtracking for the .*+ item; it can match only
the entire string. The subsequent lookbehind assertion does a
single test on the last four characters. If it fails, the match
fails immediately. For long strings, this approach makes a
significant difference to the processing time.
Using multiple assertions
Several assertions (of any sort) may occur in succession. For
example,
(?<=\d{3})(?<!999)foo
matches "foo" preceded by three digits that are not "999". Notice
that each of the assertions is applied independently at the same
point in the subject string. First there is a check that the
previous three characters are all digits, and then there is a
check that the same three characters are not "999". This pattern
does not match "foo" preceded by six characters, the first of
which are digits and the last three of which are not "999". For
example, it doesn't match "123abcfoo". A pattern to do that is
(?<=\d{3}...)(?<!999)foo
This time the first assertion looks at the preceding six
characters, checking that the first three are digits, and then
the second assertion checks that the preceding three characters
are not "999".
Assertions can be nested in any combination. For example,
(?<=(?<!foo)bar)baz
matches an occurrence of "baz" that is preceded by "bar" which in
turn is not preceded by "foo", while
(?<=\d{3}(?!999)...)foo
is another pattern that matches "foo" preceded by three digits
and any three characters that are not "999".