Identifying capturing parentheses by number is simple, but it can
be very hard to keep track of the numbers in complicated regular
expressions. Furthermore, if an expression is modified, the
numbers may change. To help with this difficulty, PCRE supports
the naming of subpatterns. This feature was not added to Perl
until release 5.10. Python had the feature earlier, and PCRE
introduced it at release 4.0, using the Python syntax. PCRE now
supports both the Perl and the Python syntax. Perl allows
identically numbered subpatterns to have different names, but
PCRE does not.
In PCRE, a subpattern can be named in one of three ways:
(?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in
Python. References to capturing parentheses from other parts of
the pattern, such as back references, recursion, and conditions,
can be made by name as well as by number.
Names consist of up to 32 alphanumeric characters and
underscores, but must start with a non-digit. Named capturing
parentheses are still allocated numbers as well as names, exactly
as if the names were not present. The PCRE API provides function
calls for extracting the name-to-number translation table from a
compiled pattern. There is also a convenience function for
extracting a captured substring by name.
By default, a name must be unique within a pattern, but it is
possible to relax this constraint by setting the PCRE_DUPNAMES
option at compile time. (Duplicate names are also always
permitted for subpatterns with the same number, set up as
described in the previous section.) Duplicate names can be useful
for patterns where only one instance of the named parentheses can
match. Suppose you want to match the name of a weekday, either as
a 3-letter abbreviation or as the full name, and in both cases
you want to extract the abbreviation. This pattern (ignoring the
line breaks) does the job:
(?<DN>Mon|Fri|Sun)(?:day)?|
(?<DN>Tue)(?:sday)?|
(?<DN>Wed)(?:nesday)?|
(?<DN>Thu)(?:rsday)?|
(?<DN>Sat)(?:urday)?
There are five capturing substrings, but only one is ever set
after a match. (An alternative way of solving this problem is to
use a "branch reset" subpattern, as described in the previous
section.)
The convenience function for extracting the data by name returns
the substring for the first (and in this example, the only)
subpattern of that name that matched. This saves searching to
find which numbered subpattern it was.
If you make a back reference to a non-unique named subpattern
from elsewhere in the pattern, the subpatterns to which the name
refers are checked in the order in which they appear in the
overall pattern. The first one that is set is used for the
reference. For example, this pattern matches both "foofoo" and
"barbar" but not "foobar" or "barfoo":
(?:(?<n>foo)|(?<n>bar))\k<n>
If you make a subroutine call to a non-unique named subpattern,
the one that corresponds to the first occurrence of the name is
used. In the absence of duplicate numbers (see the previous
section) this is the one with the lowest number.
If you use a named reference in a condition test (see the section
about conditions below), either to check whether a subpattern has
matched, or to check for recursion, all subpatterns with the same
name are tested. If the condition is true for any one of them,
the overall condition is true. This is the same behaviour as
testing by number. For further details of the interfaces for
handling named subpatterns, see the pcreapi
documentation.
Warning:
You cannot use different names to distinguish between
two subpatterns with the same number because PCRE uses only the
numbers when matching. For this reason, an error is given at
compile time if different names are given to subpatterns with the
same number. However, you can always give the same name to
subpatterns with the same number, even when PCRE_DUPNAMES is not
set.