Outside a character class, a backslash followed by a digit
greater than 0 (and possibly further digits) is a back reference
to a capturing subpattern earlier (that is, to its left) in the
pattern, provided there have been that many previous capturing
left parentheses.
However, if the decimal number following the backslash is less
than 10, it is always taken as a back reference, and causes an
error only if there are not that many capturing left parentheses
in the entire pattern. In other words, the parentheses that are
referenced need not be to the left of the reference for numbers
less than 10. A "forward back reference" of this type can make
sense when a repetition is involved and the subpattern to the
right has participated in an earlier iteration.
It is not possible to have a numerical "forward back reference"
to a subpattern whose number is 10 or more using this syntax
because a sequence such as \50 is interpreted as a character
defined in octal. See the subsection entitled "Non-printing
characters" above for further details of the handling of digits
following a backslash. There is no such problem when named
parentheses are used. A back reference to any subpattern is
possible using named parentheses (see below).
Another way of avoiding the ambiguity inherent in the use of
digits following a backslash is to use the \g escape sequence.
This escape must be followed by an unsigned number or a negative
number, optionally enclosed in braces. These examples are all
identical:
(ring), \1
(ring), \g1
(ring), \g{1}
An unsigned number specifies an absolute reference without the
ambiguity that is present in the older syntax. It is also useful
when literal digits follow the reference. A negative number is a
relative reference. Consider this example:
(abc(def)ghi)\g{-1}
The sequence \g{-1} is a reference to the most recently started
capturing subpattern before \g, that is, is it equivalent to \2
in this example. Similarly, \g{-2} would be equivalent to \1.
The use of relative references can be helpful in long patterns,
and also in patterns that are created by joining together
fragments that contain references within themselves.
A back reference matches whatever actually matched the capturing
subpattern in the current subject string, rather than anything
matching the subpattern itself (see "Subpatterns as subroutines"
below for a way of doing that). So the pattern
(sens|respons)e and \1ibility
matches "sense and sensibility" and "response and
responsibility", but not "sense and responsibility". If caseful
matching is in force at the time of the back reference, the case
of letters is relevant. For example,
((?i)rah)\s+\1
matches "rah rah" and "RAH RAH", but not "RAH rah", even though
the original capturing subpattern is matched caselessly.
There are several different ways of writing back references to
named subpatterns. The .NET syntax \k{name} and the Perl syntax
\k<name> or \k'name' are supported, as is the Python syntax
(?P=name). Perl 5.10's unified back reference syntax, in which \g
can be used for both numeric and named references, is also
supported. We could rewrite the above example in any of the
following ways:
(?<p1>(?i)rah)\s+\k<p1>
(?'p1'(?i)rah)\s+\k{p1}
(?P<p1>(?i)rah)\s+(?P=p1)
(?<p1>(?i)rah)\s+\g{p1}
A subpattern that is referenced by name may appear in the pattern
before or after the reference.
There may be more than one back reference to the same subpattern.
If a subpattern has not actually been used in a particular match,
any back references to it always fail by default. For example,
the pattern
(a|(bc))\2
always fails if it starts to match "a" rather than "bc". However,
if the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a
back reference to an unset value matches an empty string.
Because there may be many capturing parentheses in a pattern, all
digits following a backslash are taken as part of a potential
back reference number. If the pattern continues with a digit
character, some delimiter must be used to terminate the back
reference. If the PCRE_EXTENDED option is set, this can be white
space. Otherwise, the \g{ syntax or an empty comment (see
"Comments" below) can be used.
Recursive back references
A back reference that occurs inside the parentheses to which it
refers fails when the subpattern is first used, so, for example,
(a\1) never matches. However, such references can be useful
inside repeated subpatterns. For example, the pattern
(a|b\1)+
matches any number of "a"s and also "aba", "ababbaa" etc. At each
iteration of the subpattern, the back reference matches the
character string corresponding to the previous iteration. In
order for this to work, the pattern must be such that the first
iteration does not need to match the back reference. This can be
done using alternation, as in the example above, or by a
quantifier with a minimum of zero.
Back references of this type cause the group that they reference
to be treated as an atomic group. Once the whole group has been
matched, a subsequent matching failure cannot cause backtracking
into the middle of the group.