Perl-совместимые регулярные выражения (Perl-compatible regular expressions)
RECURSIVE PATTERNS
Consider the problem of matching a string in parentheses,
allowing for unlimited nested parentheses. Without the use of
recursion, the best that can be done is to use a pattern that
matches up to some fixed depth of nesting. It is not possible to
handle an arbitrary nesting depth.
For some time, Perl has provided a facility that allows regular
expressions to recurse (amongst other things). It does this by
interpolating Perl code in the expression at run time, and the
code can refer to the expression itself. A Perl pattern using
code interpolation to solve the parentheses problem can be
created like this:
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
The (?p{...}) item interpolates Perl code at run time, and in
this case refers recursively to the pattern in which it appears.
Obviously, PCRE cannot support the interpolation of Perl code.
Instead, it supports special syntax for recursion of the entire
pattern, and also for individual subpattern recursion. After its
introduction in PCRE and Python, this kind of recursion was
subsequently introduced into Perl at release 5.10.
A special item that consists of (? followed by a number greater
than zero and a closing parenthesis is a recursive subroutine
call of the subpattern of the given number, provided that it
occurs inside that subpattern. (If not, it is a non-recursive
subroutine call, which is described in the next section.) The
special item (?R) or (?0) is a recursive call of the entire
regular expression.
This PCRE pattern solves the nested parentheses problem (assume
the PCRE_EXTENDED option is set so that white space is ignored):
\( ( [^()]++ | (?R) )* \)
First it matches an opening parenthesis. Then it matches any
number of substrings which can either be a sequence of non-
parentheses, or a recursive match of the pattern itself (that is,
a correctly parenthesized substring). Finally there is a closing
parenthesis. Note the use of a possessive quantifier to avoid
backtracking into sequences of non-parentheses.
If this were part of a larger pattern, you would not want to
recurse the entire pattern, so instead you could use this:
( \( ( [^()]++ | (?1) )* \) )
We have put the pattern into parentheses, and caused the
recursion to refer to them instead of the whole pattern.
In a larger pattern, keeping track of parenthesis numbers can be
tricky. This is made easier by the use of relative references.
Instead of (?1) in the pattern above you can write (?-2) to refer
to the second most recently opened parentheses preceding the
recursion. In other words, a negative number counts capturing
parentheses leftwards from the point at which it is encountered.
It is also possible to refer to subsequently opened parentheses,
by writing references such as (?+2). However, these cannot be
recursive because the reference is not inside the parentheses
that are referenced. They are always non-recursive subroutine
calls, as described in the next section.
An alternative approach is to use named parentheses instead. The
Perl syntax for this is (?&name); PCRE's earlier syntax (?P>name)
is also supported. We could rewrite the above example as follows:
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
If there is more than one subpattern with the same name, the
earliest one is used.
This particular example pattern that we have been looking at
contains nested unlimited repeats, and so the use of a possessive
quantifier for matching strings of non-parentheses is important
when applying the pattern to strings that do not match. For
example, when this pattern is applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
it yields "no match" quickly. However, if a possessive quantifier
is not used, the match runs for a very long time indeed because
there are so many different ways the + and * repeats can carve up
the subject, and all have to be tested before failure can be
reported.
At the end of a match, the values of capturing parentheses are
those from the outermost level. If you want to obtain
intermediate values, a callout function can be used (see below
and the pcrecallout
documentation). If the pattern above is
matched against
(ab(cd)ef)
the value for the inner capturing parentheses (numbered 2) is
"ef", which is the last value taken on at the top level. If a
capturing subpattern is not matched at the top level, its final
captured value is unset, even if it was (temporarily) set at a
deeper level during the matching process.
If there are more than 15 capturing parentheses in a pattern,
PCRE has to obtain extra memory to store data during a recursion,
which it does by using pcre_malloc
, freeing it via pcre_free
afterwards. If no memory can be obtained, the match fails with
the PCRE_ERROR_NOMEMORY error.
Do not confuse the (?R) item with the condition (R), which tests
for recursion. Consider this pattern, which matches text in
angle brackets, allowing for arbitrary nesting. Only digits are
allowed in nested brackets (that is, when recursing), whereas any
characters are permitted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
In this pattern, (?(R) is the start of a conditional subpattern,
with two different alternatives for the recursive and non-
recursive cases. The (?R) item is the actual recursive call.
Differences in recursion processing between PCRE and Perl
Recursion processing in PCRE differs from Perl in two important
ways. In PCRE (like Python, but unlike Perl), a recursive
subpattern call is always treated as an atomic group. That is,
once it has matched some of the subject string, it is never re-
entered, even if it contains untried alternatives and there is a
subsequent matching failure. This can be illustrated by the
following pattern, which purports to match a palindromic string
that contains an odd number of characters (for example, "a",
"aba", "abcba", "abcdcba"):
^(.|(.)(?1)\2)$
The idea is that it either matches a single character, or two
identical characters surrounding a sub-palindrome. In Perl, this
pattern works; in PCRE it does not if the pattern is longer than
three characters. Consider the subject string "abcba":
At the top level, the first character is matched, but as it is
not at the end of the string, the first alternative fails; the
second alternative is taken and the recursion kicks in. The
recursive call to subpattern 1 successfully matches the next
character ("b"). (Note that the beginning and end of line tests
are not part of the recursion).
Back at the top level, the next character ("c") is compared with
what subpattern 2 matched, which was "a". This fails. Because the
recursion is treated as an atomic group, there are now no
backtracking points, and so the entire match fails. (Perl is
able, at this point, to re-enter the recursion and try the second
alternative.) However, if the pattern is written with the
alternatives in the other order, things are different:
^((.)(?1)\2|.)$
This time, the recursing alternative is tried first, and
continues to recurse until it runs out of characters, at which
point the recursion fails. But this time we do have another
alternative to try at the higher level. That is the big
difference: in the previous case the remaining alternative is at
a deeper recursion level, which PCRE cannot use.
To change the pattern so that it matches all palindromic strings,
not just those with an odd number of characters, it is tempting
to change the pattern to this:
^((.)(?1)\2|.?)$
Again, this works in Perl, but not in PCRE, and for the same
reason. When a deeper recursion has matched a single character,
it cannot be entered again in order to match an empty string. The
solution is to separate the two cases, and write out the odd and
even cases as alternatives at the higher level:
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
If you want to match typical palindromic phrases, the pattern has
to ignore all non-word characters, which can be done like this:
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
If run with the PCRE_CASELESS option, this pattern matches
phrases such as "A man, a plan, a canal: Panama!" and it works
well in both PCRE and Perl. Note the use of the possessive
quantifier *+ to avoid backtracking into sequences of non-word
characters. Without this, PCRE takes a great deal longer (ten
times or more) to match typical phrases, and Perl takes so long
that you think it has gone into a loop.
WARNING
: The palindrome-matching patterns above work only if the
subject string does not start with a palindrome that is shorter
than the entire string. For example, although "abcba" is
correctly matched, if the subject is "ababa", PCRE finds the
palindrome "aba" at the start, then fails at top level because
the end of the string does not follow. Once again, it cannot jump
back into the recursion to try other alternatives, so the entire
match fails.
The second way in which PCRE and Perl differ in their recursion
processing is in the handling of captured values. In Perl, when a
subpattern is called recursively or as a subpattern (see the next
section), it has no access to any values that were captured
outside the recursion, whereas in PCRE these values can be
referenced. Consider this pattern:
^(.)(\1|a(?2))
In PCRE, this pattern matches "bab". The first capturing
parentheses match "b", then in the second group, when the back
reference \1 fails to match "b", the second alternative matches
"a" and then recurses. In the recursion, \1 does now match "b"
and so the whole match succeeds. In Perl, the pattern fails to
match because inside the recursive call \1 cannot access the
externally set value.