Perl-совместимые регулярные выражения (Perl-compatible regular expressions)
BACKSLASH
The backslash character has several uses. Firstly, if it is
followed by a character that is not a number or a letter, it
takes away any special meaning that character may have. This use
of backslash as an escape character applies both inside and
outside character classes.
For example, if you want to match a * character, you write \* in
the pattern. This escaping action applies whether or not the
following character would otherwise be interpreted as a
metacharacter, so it is always safe to precede a non-alphanumeric
with backslash to specify that it stands for itself. In
particular, if you want to match a backslash, you write \\.
In a UTF mode, only ASCII numbers and letters have any special
meaning after a backslash. All other characters (in particular,
those whose codepoints are greater than 127) are treated as
literals.
If a pattern is compiled with the PCRE_EXTENDED option, most
white space in the pattern (other than in a character class), and
characters between a # outside a character class and the next
newline, inclusive, are ignored. An escaping backslash can be
used to include a white space or # character as part of the
pattern.
If you want to remove the special meaning from a sequence of
characters, you can do so by putting them between \Q and \E. This
is different from Perl in that $ and @ are handled as literals in
\Q...\E sequences in PCRE, whereas in Perl, $ and @ cause
variable interpolation. Note the following examples:
Pattern PCRE matches Perl matches
\Qabc$xyz\E abc$xyz abc followed by the
contents of $xyz
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
The \Q...\E sequence is recognized both inside and outside
character classes. An isolated \E that is not preceded by \Q is
ignored. If \Q is not followed by \E later in the pattern, the
literal interpretation continues to the end of the pattern (that
is, \E is assumed at the end). If the isolated \Q is inside a
character class, this causes an error, because the character
class is not terminated.
Non-printing characters
A second use of backslash provides a way of encoding non-printing
characters in patterns in a visible manner. There is no
restriction on the appearance of non-printing characters, apart
from the binary zero that terminates a pattern, but when a
pattern is being prepared by text editing, it is often easier to
use one of the following escape sequences than the binary
character it represents. In an ASCII or Unicode environment,
these escapes are as follows:
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII character
\e escape (hex 1B)
\f form feed (hex 0C)
\n linefeed (hex 0A)
\r carriage return (hex 0D)
\t tab (hex 09)
\0dd character with octal code 0dd
\ddd character with octal code ddd, or back reference
\o{ddd..} character with octal code ddd..
\xhh character with hex code hh
\x{hhh..} character with hex code hhh.. (non-JavaScript mode)
\uhhhh character with hex code hhhh (JavaScript mode only)
The precise effect of \cx on ASCII characters is as follows: if x
is a lower case letter, it is converted to upper case. Then bit 6
of the character (hex 40) is inverted. Thus \cA to \cZ become hex
01 to hex 1A (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is
7B), and \c; becomes hex 7B (; is 3B). If the data item (byte or
16-bit value) following \c has a value greater than 127, a
compile-time error occurs. This locks out non-ASCII characters in
all modes.
When PCRE is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t
generate the appropriate EBCDIC code values. The \c escape is
processed as specified for Perl in the perlebcdic
document. The
only characters that are allowed after \c are A-Z, a-z, or one of
@, [, \, ], ^, _, or ?. Any other character provokes a compile-
time error. The sequence \c@ encodes character code 0; after \c
the letters (in either case) encode characters 1-26 (hex 01 to
hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex
1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
Thus, apart from \c?, these escapes generate the same character
code values as they do in an ASCII environment, though the
meanings of the values mostly differ. For example, \cG always
generates code value 7, which is BEL in ASCII but DEL in EBCDIC.
The sequence \c? generates DEL (127, hex 7F) in an ASCII
environment, but because 127 is not a control character in
EBCDIC, Perl makes it generate the APC character. Unfortunately,
there are several variants of EBCDIC. In most of them the APC
character has the value 255 (hex FF), but in the one Perl calls
POSIX-BC its value is 95 (hex 5F). If certain other characters
have POSIX-BC values, PCRE makes \c? generate 95; otherwise it
generates 255.
After \0 up to two further octal digits are read. If there are
fewer than two digits, just those that are present are used. Thus
the sequence \0\x\015 specifies two binary zeros followed by a CR
character (code value 13). Make sure you supply two digits after
the initial zero if the pattern character that follows is itself
an octal digit.
The escape \o must be followed by a sequence of octal digits,
enclosed in braces. An error occurs if this is not the case. This
escape is a recent addition to Perl; it provides way of
specifying character code points as octal numbers greater than
0777, and it also allows octal numbers and back references to be
unambiguously specified.
For greater clarity and unambiguity, it is best to avoid
following \ by a digit greater than zero. Instead, use \o{} or
\x{} to specify character numbers, and \g{} to specify back
references. The following paragraphs describe the old, ambiguous
syntax.
The handling of a backslash followed by a digit other than 0 is
complicated, and Perl has changed in recent releases, causing
PCRE also to change. Outside a character class, PCRE reads the
digit and any following digits as a decimal number. If the number
is less than 8, or if there have been at least that many previous
capturing left parentheses in the expression, the entire sequence
is taken as a back reference. A description of how this works is
given later, following the discussion of parenthesized
subpatterns.
Inside a character class, or if the decimal number following \ is
greater than 7 and there have not been that many capturing
subpatterns, PCRE handles \8 and \9 as the literal characters "8"
and "9", and otherwise re-reads up to three octal digits
following the backslash, using them to generate a data character.
Any subsequent digits stand for themselves. For example:
\040 is another way of writing an ASCII space
\40 is the same, provided there are fewer than 40
previous capturing subpatterns
\7 is always a back reference
\11 might be a back reference, or another way of
writing a tab
\011 is always a tab
\0113 is a tab followed by the character "3"
\113 might be a back reference, otherwise the
character with octal code 113
\377 might be a back reference, otherwise
the value 255 (decimal)
\81 is either a back reference, or the two
characters "8" and "1"
Note that octal values of 100 or greater that are specified using
this syntax must not be introduced by a leading zero, because no
more than three octal digits are ever read.
By default, after \x that is not followed by {, from zero to two
hexadecimal digits are read (letters can be in upper or lower
case). Any number of hexadecimal digits may appear between \x{
and }. If a character other than a hexadecimal digit appears
between \x{ and }, or if there is no terminating }, an error
occurs.
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation
of \x is as just described only when it is followed by two
hexadecimal digits. Otherwise, it matches a literal "x"
character. In JavaScript mode, support for code points greater
than 256 is provided by \u, which must be followed by four
hexadecimal digits; otherwise it matches a literal "u" character.
Characters whose value is less than 256 can be defined by either
of the two syntaxes for \x (or by \u in JavaScript mode). There
is no difference in the way they are handled. For example, \xdc
is exactly the same as \x{dc} (or \u00dc in JavaScript mode).
Constraints on character values
Characters that are specified using octal or hexadecimal numbers
are limited to certain values, as follows:
8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the
so-called "surrogate" codepoints), and 0xffef.
Escape sequences in character classes
All the sequences that define a single character value can be
used both inside and outside character classes. In addition,
inside a character class, \b is interpreted as the backspace
character (hex 08).
\N is not allowed in a character class. \B, \R, and \X are not
special inside a character class. Like other unrecognized escape
sequences, they are treated as the literal characters "B", "R",
and "X" by default, but cause an error if the PCRE_EXTRA option
is set. Outside a character class, these sequences have different
meanings.
Unsupported escape sequences
In Perl, the sequences \l, \L, \u, and \U are recognized by its
string handler and used to modify the case of following
characters. By default, PCRE does not support these escape
sequences. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
\U matches a "U" character, and \u can be used to define a
character by code point, as described in the previous section.
Absolute and relative back references
The sequence \g followed by an unsigned or a negative number,
optionally enclosed in braces, is an absolute or relative back
reference. A named back reference can be coded as \g{name}. Back
references are discussed later, following the discussion of
parenthesized subpatterns.
Absolute and relative subroutine calls
For compatibility with Oniguruma, the non-Perl syntax \g followed
by a name or a number enclosed either in angle brackets or single
quotes, is an alternative syntax for referencing a subpattern as
a "subroutine". Details are discussed later. Note that \g{...}
(Perl syntax) and \g<...> (Oniguruma syntax) are not synonymous.
The former is a back reference; the latter is a subroutine call.
Generic character types
Another use of backslash is for specifying generic character
types:
\d any decimal digit
\D any character that is not a decimal digit
\h any horizontal white space character
\H any character that is not a horizontal white space
character
\s any white space character
\S any character that is not a white space character
\v any vertical white space character
\V any character that is not a vertical white space
character
\w any "word" character
\W any "non-word" character
There is also the single sequence \N, which matches a non-newline
character. This is the same as the "." metacharacter when
PCRE_DOTALL is not set. Perl also uses \N to match characters by
name; PCRE does not support this.
Each pair of lower and upper case escape sequences partitions the
complete set of characters into two disjoint sets. Any given
character matches one, and only one, of each pair. The sequences
can appear both inside and outside character classes. They each
match one character of the appropriate type. If the current
matching point is at the end of the subject string, all of them
fail, because there is no character to match.
For compatibility with Perl, \s did not used to match the VT
character (code 11), which made it different from the the POSIX
"space" class. However, Perl added VT at release 5.18, and PCRE
followed suit at release 8.34. The default \s characters are now
HT (9), LF (10), VT (11), FF (12), CR (13), and space (32), which
are defined as white space in the "C" locale. This list may vary
if locale-specific matching is taking place. For example, in some
locales the "non-breaking space" character (\xA0) is recognized
as white space, and in others the VT character is not.
A "word" character is an underscore or any character that is a
letter or digit. By default, the definition of letters and
digits is controlled by PCRE's low-valued character tables, and
may vary if locale-specific matching is taking place (see "Locale
support" in the pcreapi
page). For example, in a French locale
such as "fr_FR" in Unix-like systems, or "french" in Windows,
some character codes greater than 127 are used for accented
letters, and these are then matched by \w. The use of locales
with Unicode is discouraged.
By default, characters whose code points are greater than 127
never match \d, \s, or \w, and always match \D, \S, and \W,
although this may vary for characters in the range 128-255 when
locale-specific matching is happening. These escape sequences
retain their original meanings from before Unicode support was
available, mainly for efficiency reasons. If PCRE is compiled
with Unicode property support, and the PCRE_UCP option is set,
the behaviour is changed so that Unicode properties are used to
determine character types, as follows:
\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore
The upper case escapes match the inverse sets of characters. Note
that \d matches only decimal digits, whereas \w matches any
Unicode digit, as well as any Unicode letter, and underscore.
Note also that PCRE_UCP affects \b, and \B because they are
defined in terms of \w and \W. Matching these sequences is
noticeably slower when PCRE_UCP is set.
The sequences \h, \H, \v, and \V are features that were added to
Perl at release 5.10. In contrast to the other sequences, which
match only ASCII characters by default, these always match
certain high-valued code points, whether or not PCRE_UCP is set.
The horizontal space characters are:
U+0009 Horizontal tab (HT)
U+0020 Space
U+00A0 Non-break space
U+1680 Ogham space mark
U+180E Mongolian vowel separator
U+2000 En quad
U+2001 Em quad
U+2002 En space
U+2003 Em space
U+2004 Three-per-em space
U+2005 Four-per-em space
U+2006 Six-per-em space
U+2007 Figure space
U+2008 Punctuation space
U+2009 Thin space
U+200A Hair space
U+202F Narrow no-break space
U+205F Medium mathematical space
U+3000 Ideographic space
The vertical space characters are:
U+000A Linefeed (LF)
U+000B Vertical tab (VT)
U+000C Form feed (FF)
U+000D Carriage return (CR)
U+0085 Next line (NEL)
U+2028 Line separator
U+2029 Paragraph separator
In 8-bit, non-UTF-8 mode, only the characters with codepoints
less than 256 are relevant.
Newline sequences
Outside a character class, by default, the escape sequence \R
matches any Unicode newline sequence. In 8-bit non-UTF-8 mode \R
is equivalent to the following:
(?>\r\n|\n|\x0b|\f|\r|\x85)
This is an example of an "atomic group", details of which are
given below. This particular group matches either the two-
character sequence CR followed by LF, or one of the single
characters LF (linefeed, U+000A), VT (vertical tab, U+000B), FF
(form feed, U+000C), CR (carriage return, U+000D), or NEL (next
line, U+0085). The two-character sequence is treated as a single
unit that cannot be split.
In other modes, two additional characters whose codepoints are
greater than 255 are added: LS (line separator, U+2028) and PS
(paragraph separator, U+2029). Unicode character property
support is not needed for these characters to be recognized.
It is possible to restrict \R to match only CR, LF, or CRLF
(instead of the complete set of Unicode line endings) by setting
the option PCRE_BSR_ANYCRLF either at compile time or when the
pattern is matched. (BSR is an abbreviation for "backslash R".)
This can be made the default when PCRE is built; if this is the
case, the other behaviour can be requested via the
PCRE_BSR_UNICODE option. It is also possible to specify these
settings by starting a pattern string with one of the following
sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
These override the default and the options given to the compiling
function, but they can themselves be overridden by options given
to a matching function. Note that these special settings, which
are not Perl-compatible, are recognized only at the very start of
a pattern, and that they must be in upper case. If more than one
of them is present, the last one is used. They can be combined
with a change of newline convention; for example, a pattern can
start with:
(*ANY)(*BSR_ANYCRLF)
They can also be combined with the (*UTF8), (*UTF16), (*UTF32),
(*UTF) or (*UCP) special sequences. Inside a character class, \R
is treated as an unrecognized escape sequence, and so matches the
letter "R" by default, but causes an error if PCRE_EXTRA is set.
Unicode character properties
When PCRE is built with Unicode character property support, three
additional escape sequences that match characters with specific
properties are available. When in 8-bit non-UTF-8 mode, these
sequences are of course limited to testing characters whose
codepoints are less than 256, but they do work in this mode. The
extra escape sequences are:
\p{xx} a character with the xx property
\P{xx} a character without the xx property
\X a Unicode extended grapheme cluster
The property names represented by xx above are limited to the
Unicode script names, the general category properties, "Any",
which matches any character (including newline), and some special
PCRE properties (described in the next section). Other Perl
properties such as "InMusicalSymbols" are not currently supported
by PCRE. Note that \P{Any} does not match any characters, so
always causes a match failure.
Sets of Unicode characters are defined as belonging to certain
scripts. A character from one of these sets can be matched using
a script name. For example:
\p{Greek}
\P{Han}
Those that are not part of an identified script are lumped
together as "Common". The current list of scripts is:
Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak,
Bengali, Bopomofo, Brahmi, Braille, Buginese, Buhid,
Canadian_Aboriginal, Carian, Caucasian_Albanian, Chakma, Cham,
Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
Devanagari, Duployan, Egyptian_Hieroglyphs, Elbasan, Ethiopic,
Georgian, Glagolitic, Gothic, Grantha, Greek, Gujarati, Gurmukhi,
Han, Hangul, Hanunoo, Hebrew, Hiragana, Imperial_Aramaic,
Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer,
Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Linear_A, Linear_B,
Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic, Manichaean,
Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Myanmar,
Nabataean, New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic,
Old_North_Arabian, Old_Permic, Old_Persian, Old_South_Arabian,
Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene, Pau_Cin_Hau,
Phags_Pa, Phoenician, Psalter_Pahlavi, Rejang, Runic, Samaritan,
Saurashtra, Sharada, Shavian, Siddham, Sinhala, Sora_Sompeng,
Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
Tai_Tham, Tai_Viet, Takri, Tamil, Telugu, Thaana, Thai, Tibetan,
Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
Each character has exactly one Unicode general category property,
specified by a two-letter abbreviation. For compatibility with
Perl, negation can be specified by including a circumflex between
the opening brace and the property name. For example, \p{^Lu} is
the same as \P{Lu}.
If only one letter is specified with \p or \P, it includes all
the general category properties that start with that letter. In
this case, in the absence of negation, the curly brackets in the
escape sequence are optional; these two examples have the same
effect:
\p{L}
\pL
The following general category property codes are supported:
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator
The special property L& is also supported: it matches a character
that has the Lu, Ll, or Lt property, in other words, a letter
that is not classified as a modifier or "other".
The Cs (Surrogate) property applies only to characters in the
range U+D800 to U+DFFF. Such characters are not valid in Unicode
strings and so cannot be tested by PCRE, unless UTF validity
checking has been turned off (see the discussion of
PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK
in the pcreapi
page). Perl does not support the Cs property.
The long synonyms for property names that Perl supports (such as
\p{Letter}) are not supported by PCRE, nor is it permitted to
prefix any of these properties with "Is".
No character that is in the Unicode table has the Cn (unassigned)
property. Instead, this property is assumed for any code point
that is not in the Unicode table.
Specifying caseless matching does not affect these escape
sequences. For example, \p{Lu} always matches only upper case
letters. This is different from the behaviour of current versions
of Perl.
Matching characters by Unicode property is not fast, because PCRE
has to do a multistage table lookup in order to find a
character's property. That is why the traditional escape
sequences such as \d and \w do not use Unicode properties in PCRE
by default, though you can make them do so by setting the
PCRE_UCP option or by starting the pattern with (*UCP).
Extended grapheme clusters
The \X escape matches any number of Unicode characters that form
an "extended grapheme cluster", and treats the sequence as an
atomic group (see below). Up to and including release 8.31, PCRE
matched an earlier, simpler definition that was equivalent to
(?>\PM\pM*)
That is, it matched a character without the "mark" property,
followed by zero or more characters with the "mark" property.
Characters with the "mark" property are typically non-spacing
accents that affect the preceding character.
This simple definition was extended in Unicode to include more
complicated kinds of composite character by giving each character
a grapheme breaking property, and creating rules that use these
properties to define the boundaries of extended grapheme
clusters. In releases of PCRE later than 8.31, \X matches one of
these clusters.
\X always matches at least one character. Then it decides whether
to add additional characters according to the following rules for
ending a cluster:
1. End at the end of the subject string.
2. Do not end between CR and LF; otherwise end after any control
character.
3. Do not break Hangul (a Korean script) syllable sequences.
Hangul characters are of five types: L, V, T, LV, and LVT. An L
character may be followed by an L, V, LV, or LVT character; an LV
or V character may be followed by a V or T character; an LVT or T
character may be followed only by a T character.
4. Do not end before extending characters or spacing marks.
Characters with the "mark" property always have the "extend"
grapheme breaking property.
5. Do not end after prepend characters.
6. Otherwise, end the cluster.
PCRE's additional properties
As well as the standard Unicode properties described above, PCRE
supports four more that make it possible to convert traditional
escape sequences such as \w and \s to use Unicode properties.
PCRE uses these non-standard, non-Perl properties internally when
PCRE_UCP is set. However, they may also be used explicitly. These
properties are:
Xan Any alphanumeric character
Xps Any POSIX space character
Xsp Any Perl space character
Xwd Any Perl "word" character
Xan matches characters that have either the L (letter) or the N
(number) property. Xps matches the characters tab, linefeed,
vertical tab, form feed, or carriage return, and any other
character that has the Z (separator) property. Xsp is the same
as Xps; it used to exclude vertical tab, for Perl compatibility,
but Perl changed, and so PCRE followed at release 8.34. Xwd
matches the same characters as Xan, plus underscore.
There is another non-standard property, Xuc, which matches any
character that can be represented by a Universal Character Name
in C++ and other programming languages. These are the characters
$, @, ` (grave accent), and all characters with Unicode code
points greater than or equal to U+00A0, except for the surrogates
U+D800 to U+DFFF. Note that most base (ASCII) characters are
excluded. (Universal Character Names are of the form \uHHHH or
\UHHHHHHHH where H is a hexadecimal digit. Note that the Xuc
property does not match these sequences but the characters that
they represent.)
Resetting the match start
The escape sequence \K causes any previously matched characters
not to be included in the final matched sequence. For example,
the pattern:
foo\Kbar
matches "foobar", but reports that it has matched "bar". This
feature is similar to a lookbehind assertion (described below).
However, in this case, the part of the subject before the real
match does not have to be of fixed length, as lookbehind
assertions do. The use of \K does not interfere with the setting
of captured substrings. For example, when the pattern
(foo)\Kbar
matches "foobar", the first substring is still set to "foo".
Perl documents that the use of \K within assertions is "not well
defined". In PCRE, \K is acted upon when it occurs inside
positive assertions, but is ignored in negative assertions. Note
that when a pattern such as (?=ab\K) matches, the reported start
of the match can be greater than the end of the match.
Simple assertions
The final use of backslash is for certain simple assertions. An
assertion specifies a condition that has to be met at a
particular point in a match, without consuming any characters
from the subject string. The use of subpatterns for more
complicated assertions is described below. The backslashed
assertions are:
\b matches at a word boundary
\B matches when not at a word boundary
\A matches at the start of the subject
\Z matches at the end of the subject
also matches before a newline at the end of the subject
\z matches only at the end of the subject
\G matches at the first matching position in the subject
Inside a character class, \b has a different meaning; it matches
the backspace character. If any other of these assertions appears
in a character class, by default it matches the corresponding
literal character (for example, \B matches the letter B).
However, if the PCRE_EXTRA option is set, an "invalid escape
sequence" error is generated instead.
A word boundary is a position in the subject string where the
current character and the previous character do not both match \w
or \W (i.e. one matches \w and the other matches \W), or the
start or end of the string if the first or last character matches
\w, respectively. In a UTF mode, the meanings of \w and \W can be
changed by setting the PCRE_UCP option. When this is done, it
also affects \b and \B. Neither PCRE nor Perl has a separate
"start of word" or "end of word" metasequence. However, whatever
follows \b normally determines which it is. For example, the
fragment \ba matches "a" at the start of a word.
The \A, \Z, and \z assertions differ from the traditional
circumflex and dollar (described in the next section) in that
they only ever match at the very start and end of the subject
string, whatever options are set. Thus, they are independent of
multiline mode. These three assertions are not affected by the
PCRE_NOTBOL or PCRE_NOTEOL options, which affect only the
behaviour of the circumflex and dollar metacharacters. However,
if the startoffset argument of pcre_exec()
is non-zero,
indicating that matching is to start at a point other than the
beginning of the subject, \A can never match. The difference
between \Z and \z is that \Z matches before a newline at the end
of the string as well as at the very end, whereas \z matches only
at the end.
The \G assertion is true only when the current matching position
is at the start point of the match, as specified by the
startoffset argument of pcre_exec()
. It differs from \A when the
value of startoffset is non-zero. By calling pcre_exec()
multiple
times with appropriate arguments, you can mimic Perl's /g option,
and it is in this kind of implementation where \G can be useful.
Note, however, that PCRE's interpretation of \G, as the start of
the current match, is subtly different from Perl's, which defines
it as the end of the previous match. In Perl, these can be
different when the previously matched string was empty. Because
PCRE does just one match at a time, it cannot reproduce this
behaviour.
If all the alternatives of a pattern begin with \G, the
expression is anchored to the starting match position, and the
"anchored" flag is set in the compiled regular expression.