It is possible to cause the matching process to obey a subpattern
conditionally or to choose between two alternative subpatterns,
depending on the result of an assertion, or whether a specific
capturing subpattern has already been matched. The two possible
forms of conditional subpattern are:
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
If the condition is satisfied, the yes-pattern is used; otherwise
the no-pattern (if present) is used. If there are more than two
alternatives in the subpattern, a compile-time error occurs. Each
of the two alternatives may itself contain nested subpatterns of
any form, including conditional subpatterns; the restriction to
two alternatives applies only at the level of the condition. This
pattern fragment is an example where the alternatives are
complex:
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
There are four kinds of condition: references to subpatterns,
references to recursion, a pseudo-condition called DEFINE, and
assertions.
Checking for a used subpattern by number
If the text between the parentheses consists of a sequence of
digits, the condition is true if a capturing subpattern of that
number has previously matched. If there is more than one
capturing subpattern with the same number (see the earlier
section about duplicate subpattern numbers), the condition is
true if any of them have matched. An alternative notation is to
precede the digits with a plus or minus sign. In this case, the
subpattern number is relative rather than absolute. The most
recently opened parentheses can be referenced by (?(-1), the next
most recent by (?(-2), and so on. Inside loops it can also make
sense to refer to subsequent groups. The next parentheses to be
opened can be referenced as (?(+1), and so on. (The value zero in
any of these forms is not used; it provokes a compile-time
error.)
Consider the following pattern, which contains non-significant
white space to make it more readable (assume the PCRE_EXTENDED
option) and to divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) )
The first part matches an optional opening parenthesis, and if
that character is present, sets it as the first captured
substring. The second part matches one or more characters that
are not parentheses. The third part is a conditional subpattern
that tests whether or not the first set of parentheses matched.
If they did, that is, if subject started with an opening
parenthesis, the condition is true, and so the yes-pattern is
executed and a closing parenthesis is required. Otherwise, since
no-pattern is not present, the subpattern matches nothing. In
other words, this pattern matches a sequence of non-parentheses,
optionally enclosed in parentheses.
If you were embedding this pattern in a larger one, you could use
a relative reference:
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
This makes the fragment independent of the parentheses in the
larger pattern.
Checking for a used subpattern by name
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for
a used subpattern by name. For compatibility with earlier
versions of PCRE, which had this facility before Perl, the syntax
(?(name)...) is also recognized.
Rewriting the above example to use a named subpattern gives this:
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
If the name used in a condition of this kind is a duplicate, the
test is applied to all subpatterns of the same name, and is true
if any one of them has matched.
Checking for pattern recursion
If the condition is the string (R), and there is no subpattern
with the name R, the condition is true if a recursive call to the
whole pattern or any subpattern has been made. If digits or a
name preceded by ampersand follow the letter R, for example:
(?(R3)...) or (?(R&name)...)
the condition is true if the most recent recursion is into a
subpattern whose number or name is given. This condition does not
check the entire recursion stack. If the name used in a condition
of this kind is a duplicate, the test is applied to all
subpatterns of the same name, and is true if any one of them is
the most recent recursion.
At "top level", all these recursion test conditions are false.
The syntax for recursive patterns is described below.
Defining subpatterns for use by reference only
If the condition is the string (DEFINE), and there is no
subpattern with the name DEFINE, the condition is always false.
In this case, there may be only one alternative in the
subpattern. It is always skipped if control reaches this point in
the pattern; the idea of DEFINE is that it can be used to define
subroutines that can be referenced from elsewhere. (The use of
subroutines is described below.) For example, a pattern to match
an IPv4 address such as "192.168.23.245" could be written like
this (ignore white space and line breaks):
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b
The first part of the pattern is a DEFINE group inside which a
another group named "byte" is defined. This matches an individual
component of an IPv4 address (a number less than 256). When
matching takes place, this part of the pattern is skipped because
DEFINE acts like a false condition. The rest of the pattern uses
references to the named group to match the four dot-separated
components of an IPv4 address, insisting on a word boundary at
each end.
Assertion conditions
If the condition is not in any of the above formats, it must be
an assertion. This may be a positive or negative lookahead or
lookbehind assertion. Consider this pattern, again containing
non-significant white space, and with the two alternatives on the
second line:
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
The condition is a positive lookahead assertion that matches an
optional sequence of non-letters followed by a letter. In other
words, it tests for the presence of at least one letter in the
subject. If a letter is found, the subject is matched against the
first alternative; otherwise it is matched against the second.
This pattern matches strings in one of the two forms dd-aaa-dd or
dd-dd-dd, where aaa are letters and dd are digits.