Perl 5.10 introduced a number of "Special Backtracking Control
Verbs", which are still described in the Perl documentation as
"experimental and subject to change or removal in a future
version of Perl". It goes on to say: "Their usage in production
code should be noted to avoid problems during upgrades." The same
remarks apply to the PCRE features described in this section.
The new verbs make use of what was previously invalid syntax: an
opening parenthesis followed by an asterisk. They are generally
of the form (*VERB) or (*VERB:NAME). Some may take either form,
possibly behaving differently depending on whether or not a name
is present. A name is any sequence of characters that does not
include a closing parenthesis. The maximum length of name is 255
in the 8-bit library and 65535 in the 16-bit and 32-bit
libraries. If the name is empty, that is, if the closing
parenthesis immediately follows the colon, the effect is as if
the colon were not there. Any number of these verbs may occur in
a pattern.
Since these verbs are specifically related to backtracking, most
of them can be used only when the pattern is to be matched using
one of the traditional matching functions, because these use a
backtracking algorithm. With the exception of (*FAIL), which
behaves like a failing negative assertion, the backtracking
control verbs cause an error if encountered by a DFA matching
function.
The behaviour of these verbs in repeated groups, assertions, and
in subpatterns called as subroutines (whether or not recursively)
is documented below.
Optimizations that affect backtracking verbs
PCRE contains some optimizations that are used to speed up
matching by running some checks at the start of each match
attempt. For example, it may know the minimum length of matching
subject, or that a particular character must be present. When one
of these optimizations bypasses the running of a match, any
included backtracking verbs will not, of course, be processed.
You can suppress the start-of-match optimizations by setting the
PCRE_NO_START_OPTIMIZE option when calling pcre_compile()
or
pcre_exec()
, or by starting the pattern with (*NO_START_OPT).
There is more discussion of this option in the section entitled
"Option bits for pcre_exec()
" in the pcreapi
documentation.
Experiments with Perl suggest that it too has similar
optimizations, sometimes leading to anomalous results.
Verbs that act immediately
The following verbs act as soon as they are encountered. They may
not be followed by a name.
(*ACCEPT)
This verb causes the match to end successfully, skipping the
remainder of the pattern. However, when it is inside a subpattern
that is called as a subroutine, only that subpattern is ended
successfully. Matching then continues at the outer level. If
(*ACCEPT) in triggered in a positive assertion, the assertion
succeeds; in a negative assertion, the assertion fails.
If (*ACCEPT) is inside capturing parentheses, the data so far is
captured. For example:
A((?:A|B(*ACCEPT)|C)D)
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is
captured by the outer parentheses.
(*FAIL) or (*F)
This verb causes a matching failure, forcing backtracking to
occur. It is equivalent to (?!) but easier to read. The Perl
documentation notes that it is probably useful only when combined
with (?{}) or (??{}). Those are, of course, Perl features that
are not present in PCRE. The nearest equivalent is the callout
feature, as for example in this pattern:
a+(?C)(*FAIL)
A match with the string "aaaa" always fails, but the callout is
taken before each backtrack happens (in this example, 10 times).
Recording which path was taken
There is one verb whose main purpose is to track how a match was
arrived at, though it also has a secondary use in conjunction
with advancing the match starting point (see (*SKIP) below).
(*MARK:NAME) or (*:NAME)
A name is always required with this verb. There may be as many
instances of (*MARK) as you like in a pattern, and their names do
not have to be unique.
When a match succeeds, the name of the last-encountered
(*MARK:NAME), (*PRUNE:NAME), or (*THEN:NAME) on the matching path
is passed back to the caller as described in the section entitled
"Extra data for pcre_exec()
" in the pcreapi
documentation. Here
is an example of pcretest
output, where the /K modifier requests
the retrieval and outputting of (*MARK) data:
re> /X(*MARK:A)Y|X(*MARK:B)Z/K
data> XY
0: XY
MK: A
XZ
0: XZ
MK: B
The (*MARK) name is tagged with "MK:" in this output, and in this
example it indicates which of the two alternatives matched. This
is a more efficient way of obtaining this information than
putting each alternative in its own capturing parentheses.
If a verb with a name is encountered in a positive assertion that
is true, the name is recorded and passed back if it is the last-
encountered. This does not happen for negative assertions or
failing positive assertions.
After a partial match or a failed match, the last encountered
name in the entire match process is returned. For example:
re> /X(*MARK:A)Y|X(*MARK:B)Z/K
data> XP
No match, mark = B
Note that in this unanchored example the mark is retained from
the match attempt that started at the letter "X" in the subject.
Subsequent match attempts starting at "P" and then with an empty
string do not get as far as the (*MARK) item, but nevertheless do
not reset it.
If you are interested in (*MARK) values after failed matches, you
should probably set the PCRE_NO_START_OPTIMIZE option (see above)
to ensure that the match is always attempted.
Verbs that act after backtracking
The following verbs do nothing when they are encountered.
Matching continues with what follows, but if there is no
subsequent match, causing a backtrack to the verb, a failure is
forced. That is, backtracking cannot pass to the left of the
verb. However, when one of these verbs appears inside an atomic
group or an assertion that is true, its effect is confined to
that group, because once the group has been matched, there is
never any backtracking into it. In this situation, backtracking
can "jump back" to the left of the entire atomic group or
assertion. (Remember also, as stated above, that this
localization also applies in subroutine calls.)
These verbs differ in exactly what kind of failure occurs when
backtracking reaches them. The behaviour described below is what
happens when the verb is not in a subroutine or an assertion.
Subsequent sections cover these special cases.
(*COMMIT)
This verb, which may not be followed by a name, causes the whole
match to fail outright if there is a later matching failure that
causes backtracking to reach it. Even if the pattern is
unanchored, no further attempts to find a match by advancing the
starting point take place. If (*COMMIT) is the only backtracking
verb that is encountered, once it has been passed pcre_exec()
is
committed to finding a match at the current starting point, or
not at all. For example:
a+(*COMMIT)b
This matches "xxaab" but not "aacaab". It can be thought of as a
kind of dynamic anchor, or "I've started, so I must finish." The
name of the most recently passed (*MARK) in the path is passed
back when (*COMMIT) forces a match failure.
If there is more than one backtracking verb in a pattern, a
different one that follows (*COMMIT) may be triggered first, so
merely passing (*COMMIT) during a match does not always guarantee
that a match must be at this starting point.
Note that (*COMMIT) at the start of a pattern is not the same as
an anchor, unless PCRE's start-of-match optimizations are turned
off, as shown in this output from pcretest
:
re> /(*COMMIT)abc/
data> xyzabc
0: abc
data> xyzabc\Y
No match
For this pattern, PCRE knows that any match must start with "a",
so the optimization skips along the subject to "a" before
applying the pattern to the first set of data. The match attempt
then succeeds. In the second set of data, the escape sequence \Y
is interpreted by the pcretest
program. It causes the
PCRE_NO_START_OPTIMIZE option to be set when pcre_exec()
is
called. This disables the optimization that skips along to the
first character. The pattern is now applied starting at "x", and
so the (*COMMIT) causes the match to fail without trying any
other starting points.
(*PRUNE) or (*PRUNE:NAME)
This verb causes the match to fail at the current starting
position in the subject if there is a later matching failure that
causes backtracking to reach it. If the pattern is unanchored,
the normal "bumpalong" advance to the next starting character
then happens. Backtracking can occur as usual to the left of
(*PRUNE), before it is reached, or when matching to the right of
(*PRUNE), but if there is no match to the right, backtracking
cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) is
just an alternative to an atomic group or possessive quantifier,
but there are some uses of (*PRUNE) that cannot be expressed in
any other way. In an anchored pattern (*PRUNE) has the same
effect as (*COMMIT).
The behaviour of (*PRUNE:NAME) is the not the same as
(*MARK:NAME)(*PRUNE). It is like (*MARK:NAME) in that the name
is remembered for passing back to the caller. However,
(*SKIP:NAME) searches only for names set with (*MARK).
(*SKIP)
This verb, when given without a name, is like (*PRUNE), except
that if the pattern is unanchored, the "bumpalong" advance is not
to the next character, but to the position in the subject where
(*SKIP) was encountered. (*SKIP) signifies that whatever text was
matched leading up to it cannot be part of a successful match.
Consider:
a+(*SKIP)b
If the subject is "aaaac...", after the first match attempt fails
(starting at the first character in the string), the starting
point skips on to start the next attempt at "c". Note that a
possessive quantifier does not have the same effect as this
example; although it would suppress backtracking during the first
match attempt, the second attempt would start at the second
character instead of skipping on to "c".
(*SKIP:NAME)
When (*SKIP) has an associated name, its behaviour is modified.
When it is triggered, the previous path through the pattern is
searched for the most recent (*MARK) that has the same name. If
one is found, the "bumpalong" advance is to the subject position
that corresponds to that (*MARK) instead of to where (*SKIP) was
encountered. If no (*MARK) with a matching name is found, the
(*SKIP) is ignored.
Note that (*SKIP:NAME) searches only for names set by
(*MARK:NAME). It ignores names that are set by (*PRUNE:NAME) or
(*THEN:NAME).
(*THEN) or (*THEN:NAME)
This verb causes a skip to the next innermost alternative when
backtracking reaches it. That is, it cancels any further
backtracking within the current alternative. Its name comes from
the observation that it can be used for a pattern-based if-then-
else block:
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ )
...
If the COND1 pattern matches, FOO is tried (and possibly further
items after the end of the group if FOO succeeds); on failure,
the matcher skips to the second alternative and tries COND2,
without backtracking into COND1. If that succeeds and BAR fails,
COND3 is tried. If subsequently BAZ fails, there are no more
alternatives, so there is a backtrack to whatever came before the
entire group. If (*THEN) is not inside an alternation, it acts
like (*PRUNE).
The behaviour of (*THEN:NAME) is the not the same as
(*MARK:NAME)(*THEN). It is like (*MARK:NAME) in that the name is
remembered for passing back to the caller. However, (*SKIP:NAME)
searches only for names set with (*MARK).
A subpattern that does not contain a | character is just a part
of the enclosing alternative; it is not a nested alternation with
only one alternative. The effect of (*THEN) extends beyond such a
subpattern to the enclosing alternative. Consider this pattern,
where A, B, etc. are complex pattern fragments that do not
contain any | characters at this level:
A (B(*THEN)C) | D
If A and B are matched, but there is a failure in C, matching
does not backtrack into A; instead it moves to the next
alternative, that is, D. However, if the subpattern containing
(*THEN) is given an alternative, it behaves differently:
A (B(*THEN)C | (*FAIL)) | D
The effect of (*THEN) is now confined to the inner subpattern.
After a failure in C, matching moves to (*FAIL), which causes the
whole subpattern to fail because there are no more alternatives
to try. In this case, matching does now backtrack into A.
Note that a conditional subpattern is not considered as having
two alternatives, because only one is ever used. In other words,
the | character in a conditional subpattern has a different
meaning. Ignoring white space, consider:
^.*? (?(?=a) a | b(*THEN)c )
If the subject is "ba", this pattern does not match. Because .*?
is ungreedy, it initially matches zero characters. The condition
(?=a) then fails, the character "b" is matched, but "c" is not.
At this point, matching does not backtrack to .*? as might
perhaps be expected from the presence of the | character. The
conditional subpattern is part of the single alternative that
comprises the whole pattern, and so the match fails. (If there
was a backtrack into .*?, allowing it to match "b", the match
would succeed.)
The verbs just described provide four different "strengths" of
control when subsequent matching fails. (*THEN) is the weakest,
carrying on the match at the next alternative. (*PRUNE) comes
next, failing the match at the current starting position, but
allowing an advance to the next character (for an unanchored
pattern). (*SKIP) is similar, except that the advance may be more
than one character. (*COMMIT) is the strongest, causing the
entire match to fail.
More than one backtracking verb
If more than one backtracking verb is present in a pattern, the
one that is backtracked onto first acts. For example, consider
this pattern, where A, B, etc. are complex pattern fragments:
(A(*COMMIT)B(*THEN)C|ABD)
If A matches but B fails, the backtrack to (*COMMIT) causes the
entire match to fail. However, if A and B match, but C fails, the
backtrack to (*THEN) causes the next alternative (ABD) to be
tried. This behaviour is consistent, but is not always the same
as Perl's. It means that if two or more backtracking verbs appear
in succession, all the the last of them has no effect. Consider
this example:
...(*COMMIT)(*PRUNE)...
If there is a matching failure to the right, backtracking onto
(*PRUNE) causes it to be triggered, and its action is taken.
There can never be a backtrack onto (*COMMIT).
Backtracking verbs in repeated groups
PCRE differs from Perl in its handling of backtracking verbs in
repeated groups. For example, consider:
/(a(*COMMIT)b)+ac/
If the subject is "abac", Perl matches, but PCRE fails because
the (*COMMIT) in the second repeat of the group acts.
Backtracking verbs in assertions
(*FAIL) in an assertion has its normal effect: it forces an
immediate backtrack.
(*ACCEPT) in a positive assertion causes the assertion to succeed
without any further processing. In a negative assertion,
(*ACCEPT) causes the assertion to fail without any further
processing.
The other backtracking verbs are not treated specially if they
appear in a positive assertion. In particular, (*THEN) skips to
the next alternative in the innermost enclosing group that has
alternations, whether or not this is within the assertion.
Negative assertions are, however, different, in order to ensure
that changing a positive assertion into a negative assertion
changes its result. Backtracking into (*COMMIT), (*SKIP), or
(*PRUNE) causes a negative assertion to be true, without
considering any further alternative branches in the assertion.
Backtracking into (*THEN) causes it to skip to the next enclosing
alternative within the assertion (the normal behaviour), but if
the assertion does not have such an alternative, (*THEN) behaves
like (*PRUNE).
Backtracking verbs in subroutines
These behaviours occur whether or not the subpattern is called
recursively. Perl's treatment of subroutines is different in
some cases.
(*FAIL) in a subpattern called as a subroutine has its normal
effect: it forces an immediate backtrack.
(*ACCEPT) in a subpattern called as a subroutine causes the
subroutine match to succeed without any further processing.
Matching then continues after the subroutine call.
(*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a
subroutine cause the subroutine match to fail.
(*THEN) skips to the next alternative in the innermost enclosing
group within the subpattern that has alternatives. If there is no
such group within the subpattern, (*THEN) causes the subroutine
match to fail.