Perl-совместимые регулярные выражения (Perl-compatible regular expressions)
SPECIAL START-OF-PATTERN ITEMS
A number of options that can be passed to pcre_compile()
can also
be set by special items at the start of a pattern. These are not
Perl-compatible, but are provided to make these options
accessible to pattern writers who are not able to change the
program that processes the pattern. Any number of these items may
appear, but they must all be together right at the start of the
pattern string, and the letters must be in upper case.
UTF support
The original operation of PCRE was on strings of one-byte
characters. However, there is now also support for UTF-8 strings
in the original library, an extra library that supports 16-bit
and UTF-16 character strings, and a third library that supports
32-bit and UTF-32 character strings. To use these features, PCRE
must be built to include appropriate support. When using UTF
strings you must either call the compiling function with the
PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option, or the pattern must
start with one of these special sequences:
(*UTF8)
(*UTF16)
(*UTF32)
(*UTF)
(*UTF) is a generic sequence that can be used with any of the
libraries. Starting a pattern with such a sequence is equivalent
to setting the relevant option. How setting a UTF mode affects
pattern matching is mentioned in several places below. There is
also a summary of features in the pcreunicode
page.
Some applications that allow their users to supply patterns may
wish to restrict them to non-UTF data for security reasons. If
the PCRE_NEVER_UTF option is set at compile time, (*UTF) etc. are
not allowed, and their appearance causes an error.
Unicode property support
Another special sequence that may appear at the start of a
pattern is (*UCP). This has the same effect as setting the
PCRE_UCP option: it causes sequences such as \d and \w to use
Unicode properties to determine character types, instead of
recognizing only characters with codes less than 128 via a lookup
table.
Disabling auto-possessification
If a pattern starts with (*NO_AUTO_POSSESS), it has the same
effect as setting the PCRE_NO_AUTO_POSSESS option at compile
time. This stops PCRE from making quantifiers possessive when
what follows cannot match the repeated item. For example, by
default a+b is treated as a++b. For more details, see the pcreapi
documentation.
Disabling start-up optimizations
If a pattern starts with (*NO_START_OPT), it has the same effect
as setting the PCRE_NO_START_OPTIMIZE option either at compile or
matching time. This disables several optimizations for quickly
reaching "no match" results. For more details, see the pcreapi
documentation.
Newline conventions
PCRE supports five different conventions for indicating line
breaks in strings: a single CR (carriage return) character, a
single LF (linefeed) character, the two-character sequence CRLF,
any of the three preceding, or any Unicode newline sequence. The
pcreapi
page has further discussion about newlines, and shows how
to set the newline convention in the options arguments for the
compiling and matching functions.
It is also possible to specify a newline convention by starting a
pattern string with one of the following five sequences:
(*CR) carriage return
(*LF) linefeed
(*CRLF) carriage return, followed by linefeed
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
These override the default and the options given to the compiling
function. For example, on a Unix system where LF is the default
newline sequence, the pattern
(*CR)a.b
changes the convention to CR. That pattern matches "a\nb" because
LF is no longer a newline. If more than one of these settings is
present, the last one is used.
The newline convention affects where the circumflex and dollar
assertions are true. It also affects the interpretation of the
dot metacharacter when PCRE_DOTALL is not set, and the behaviour
of \N. However, it does not affect what the \R escape sequence
matches. By default, this is any Unicode newline sequence, for
Perl compatibility. However, this can be changed; see the
description of \R in the section entitled "Newline sequences"
below. A change of \R setting can be combined with a change of
newline convention.
Setting match and recursion limits
The caller of pcre_exec()
can set a limit on the number of times
the internal match()
function is called and on the maximum depth
of recursive calls. These facilities are provided to catch
runaway matches that are provoked by patterns with huge matching
trees (a typical example is a pattern with nested unlimited
repeats) and to avoid running out of system stack by too much
recursion. When one of these limits is reached, pcre_exec()
gives
an error return. The limits can also be set by items at the start
of the pattern of the form
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
where d is any number of decimal digits. However, the value of
the setting must be less than the value set (or defaulted) by the
caller of pcre_exec()
for it to have any effect. In other words,
the pattern writer can lower the limits set by the programmer,
but not raise them. If there is more than one setting of one of
these limits, the lower value is used.