Perl-совместимые регулярные выражения (Perl-compatible regular expressions)
COMPILING A PATTERN
pcre *pcre_compile(const char *
pattern, int
options,
const char **
errptr, int *
erroffset,
const unsigned char *
tableptr);
pcre *pcre_compile2(const char *
pattern, int
options,
int *
errorcodeptr,
const char **
errptr, int *
erroffset,
const unsigned char *
tableptr);
Either of the functions pcre_compile()
or pcre_compile2()
can be
called to compile a pattern into an internal form. The only
difference between the two interfaces is that pcre_compile2()
has
an additional argument, errorcodeptr, via which a numerical error
code can be returned. To avoid too much repetition, we refer just
to pcre_compile()
below, but the information applies equally to
pcre_compile2()
.
The pattern is a C string terminated by a binary zero, and is
passed in the pattern argument. A pointer to a single block of
memory that is obtained via pcre_malloc
is returned. This
contains the compiled code and related data. The pcre
type is
defined for the returned block; this is a typedef for a structure
whose contents are not externally defined. It is up to the caller
to free the memory (via pcre_free
) when it is no longer required.
Although the compiled code of a PCRE regex is relocatable, that
is, it does not depend on memory location, the complete pcre
data
block is not fully relocatable, because it may contain a copy of
the tableptr argument, which is an address (see below).
The options argument contains various bit settings that affect
the compilation. It should be zero if no options are required.
The available options are described below. Some of them (in
particular, those that are compatible with Perl, but some others
as well) can also be set and unset from within the pattern (see
the detailed description in the pcrepattern
documentation). For
those options that can be different in different parts of the
pattern, the contents of the options argument specifies their
settings at the start of compilation and execution. The
PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx,
PCRE_NO_UTF8_CHECK, and PCRE_NO_START_OPTIMIZE options can be set
at the time of matching as well as at compile time.
If errptr is NULL, pcre_compile()
returns NULL immediately.
Otherwise, if compilation of a pattern fails, pcre_compile()
returns NULL, and sets the variable pointed to by errptr to point
to a textual error message. This is a static string that is part
of the library. You must not try to free it. Normally, the offset
from the start of the pattern to the data unit that was being
processed when the error was discovered is placed in the variable
pointed to by erroffset, which must not be NULL (if it is, an
immediate error is given). However, for an invalid UTF-8 or
UTF-16 string, the offset is that of the first data unit of the
failing character.
Some errors are not detected until the whole pattern has been
scanned; in these cases, the offset passed back is the length of
the pattern. Note that the offset is in data units, not
characters, even in a UTF mode. It may sometimes point into the
middle of a UTF-8 or UTF-16 character.
If pcre_compile2()
is used instead of pcre_compile()
, and the
errorcodeptr argument is not NULL, a non-zero error code number
is returned via this argument in the event of an error. This is
in addition to the textual error message. Error codes and
messages are listed below.
If the final argument, tableptr, is NULL, PCRE uses a default set
of character tables that are built when PCRE is compiled, using
the default C locale. Otherwise, tableptr must be an address that
is the result of a call to pcre_maketables()
. This value is
stored with the compiled pattern, and used again by pcre_exec()
and pcre_dfa_exec()
when the pattern is matched. For more
discussion, see the section on locale support below.
This code fragment shows a typical straightforward call to
pcre_compile()
:
pcre *re;
const char *error;
int erroffset;
re = pcre_compile(
"^A.*Z", /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
The following names for option bits are defined in the pcre.h
header file:
PCRE_ANCHORED
If this bit is set, the pattern is forced to be "anchored", that
is, it is constrained to match only at the first matching point
in the string that is being searched (the "subject string"). This
effect can also be achieved by appropriate constructs in the
pattern itself, which is the only way to do it in Perl.
PCRE_AUTO_CALLOUT
If this bit is set, pcre_compile()
automatically inserts callout
items, all with number 255, before each pattern item. For
discussion of the callout facility, see the pcrecallout
documentation.
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
These options (which are mutually exclusive) control what the \R
escape sequence matches. The choice is either to match only CR,
LF, or CRLF, or to match any Unicode newline sequence. The
default is specified when PCRE is built. It can be overridden
from within the pattern, or by setting an option when a compiled
pattern is matched.
PCRE_CASELESS
If this bit is set, letters in the pattern match both upper and
lower case letters. It is equivalent to Perl's /i option, and it
can be changed within a pattern by a (?i) option setting. In
UTF-8 mode, PCRE always understands the concept of case for
characters whose values are less than 128, so caseless matching
is always possible. For characters with higher values, the
concept of case is supported if PCRE is compiled with Unicode
property support, but not otherwise. If you want to use caseless
matching for characters 128 and above, you must ensure that PCRE
is compiled with Unicode property support as well as with UTF-8
support.
PCRE_DOLLAR_ENDONLY
If this bit is set, a dollar metacharacter in the pattern matches
only at the end of the subject string. Without this option, a
dollar also matches immediately before a newline at the end of
the string (but not before any other newlines). The
PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
There is no equivalent to this option in Perl, and no way to set
it within a pattern.
PCRE_DOTALL
If this bit is set, a dot metacharacter in the pattern matches a
character of any value, including one that indicates a newline.
However, it only ever matches one character, even if newlines are
coded as CRLF. Without this option, a dot does not match when the
current position is at a newline. This option is equivalent to
Perl's /s option, and it can be changed within a pattern by a
(?s) option setting. A negative class such as [^a] always matches
newline characters, independent of the setting of this option.
PCRE_DUPNAMES
If this bit is set, names used to identify capturing subpatterns
need not be unique. This can be helpful for certain types of
pattern when it is known that only one instance of the named
subpattern can ever be matched. There are more details of named
subpatterns below; see also the pcrepattern
documentation.
PCRE_EXTENDED
If this bit is set, most white space characters in the pattern
are totally ignored except when escaped or inside a character
class. However, white space is not allowed within sequences such
as (?> that introduce various parenthesized subpatterns, nor
within a numerical quantifier such as {1,3}. However, ignorable
white space is permitted between an item and a following
quantifier and between a quantifier and a following + that
indicates possessiveness.
White space did not used to include the VT character (code 11),
because Perl did not treat this character as white space.
However, Perl changed at release 5.18, so PCRE followed at
release 8.34, and VT is now treated as white space.
PCRE_EXTENDED also causes characters between an unescaped #
outside a character class and the next newline, inclusive, to be
ignored. PCRE_EXTENDED is equivalent to Perl's /x option, and it
can be changed within a pattern by a (?x) option setting.
Which characters are interpreted as newlines is controlled by the
options passed to pcre_compile()
or by a special sequence at the
start of the pattern, as described in the section entitled
"Newline conventions" in the pcrepattern
documentation. Note that
the end of this type of comment is a literal newline sequence in
the pattern; escape sequences that happen to represent a newline
do not count.
This option makes it possible to include comments inside
complicated patterns. Note, however, that this applies only to
data characters. White space characters may never appear within
special character sequences in a pattern, for example within the
sequence (?( that introduces a conditional subpattern.
PCRE_EXTRA
This option was invented in order to turn on additional
functionality of PCRE that is incompatible with Perl, but it is
currently of very little use. When set, any backslash in a
pattern that is followed by a letter that has no special meaning
causes an error, thus reserving these combinations for future
expansion. By default, as in Perl, a backslash followed by a
letter with no special meaning is treated as a literal. (Perl
can, however, be persuaded to give an error for this, by running
it with the -w option.) There are at present no other features
controlled by this option. It can also be set by a (?X) option
setting within a pattern.
PCRE_FIRSTLINE
If this option is set, an unanchored pattern is required to match
before or at the first newline in the subject string, though the
matched text may continue over the newline.
PCRE_JAVASCRIPT_COMPAT
If this option is set, PCRE's behaviour is changed in some ways
so that it is compatible with JavaScript rather than Perl. The
changes are as follows:
(1) A lone closing square bracket in a pattern causes a compile-
time error, because this is illegal in JavaScript (by default it
is treated as a data character). Thus, the pattern AB]CD becomes
illegal when this option is set.
(2) At run time, a back reference to an unset subpattern group
matches an empty string (by default this causes the current
matching alternative to fail). A pattern such as (\1)(a) succeeds
when this option is set (assuming it can find an "a" in the
subject), whereas it fails by default, for Perl compatibility.
(3) \U matches an upper case "U" character; by default \U causes
a compile time error (Perl uses \U to upper case subsequent
characters).
(4) \u matches a lower case "u" character unless it is followed
by four hexadecimal digits, in which case the hexadecimal number
defines the code point to match. By default, \u causes a compile
time error (Perl uses it to upper case the following character).
(5) \x matches a lower case "x" character unless it is followed
by two hexadecimal digits, in which case the hexadecimal number
defines the code point to match. By default, as in Perl, a
hexadecimal number is always expected after \x, but it may have
zero, one, or two digits (so, for example, \xz matches a binary
zero character followed by z).
PCRE_MULTILINE
By default, for the purposes of matching "start of line" and "end
of line", PCRE treats the subject string as consisting of a
single line of characters, even if it actually contains newlines.
The "start of line" metacharacter (^) matches only at the start
of the string, and the "end of line" metacharacter ($) matches
only at the end of the string, or before a terminating newline
(except when PCRE_DOLLAR_ENDONLY is set). Note, however, that
unless PCRE_DOTALL is set, the "any character" metacharacter (.)
does not match at a newline. This behaviour (for ^, $, and dot)
is the same as Perl.
When PCRE_MULTILINE it is set, the "start of line" and "end of
line" constructs match immediately following or immediately
before internal newlines in the subject string, respectively, as
well as at the very start and end. This is equivalent to Perl's
/m option, and it can be changed within a pattern by a (?m)
option setting. If there are no newlines in a subject string, or
no occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has
no effect.
PCRE_NEVER_UTF
This option locks out interpretation of the pattern as UTF-8 (or
UTF-16 or UTF-32 in the 16-bit and 32-bit libraries). In
particular, it prevents the creator of the pattern from switching
to UTF interpretation by starting the pattern with (*UTF). This
may be useful in applications that process patterns from external
sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also
causes an error.
PCRE_NEWLINE_CR
PCRE_NEWLINE_LF
PCRE_NEWLINE_CRLF
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
These options override the default newline definition that was
chosen when PCRE was built. Setting the first or the second
specifies that a newline is indicated by a single character (CR
or LF, respectively). Setting PCRE_NEWLINE_CRLF specifies that a
newline is indicated by the two-character CRLF sequence. Setting
PCRE_NEWLINE_ANYCRLF specifies that any of the three preceding
sequences should be recognized. Setting PCRE_NEWLINE_ANY
specifies that any Unicode newline sequence should be recognized.
In an ASCII/Unicode environment, the Unicode newline sequences
are the three just mentioned, plus the single characters VT
(vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
U+0085), LS (line separator, U+2028), and PS (paragraph
separator, U+2029). For the 8-bit library, the last two are
recognized only in UTF-8 mode.
When PCRE is compiled to run in an EBCDIC (mainframe)
environment, the code for CR is 0x0d, the same as ASCII. However,
the character code for LF is normally 0x15, though in some EBCDIC
environments 0x25 is used. Whichever of these is not LF is made
to correspond to Unicode's NEL character. EBCDIC codes are all
less than 256. For more details, see the pcrebuild
documentation.
The newline setting in the options word uses three bits that are
treated as a number, giving eight possibilities. Currently only
six are used (default plus the five values above). This means
that if you set more than one newline option, the combination may
or may not be sensible. For example, PCRE_NEWLINE_CR with
PCRE_NEWLINE_LF is equivalent to PCRE_NEWLINE_CRLF, but other
combinations may yield unused numbers and cause an error.
The only time that a line break in a pattern is specially
recognized when compiling is when PCRE_EXTENDED is set. CR and LF
are white space characters, and so are ignored in this mode.
Also, an unescaped # outside a character class indicates a
comment that lasts until after the next line break sequence. In
other circumstances, line break sequences in patterns are treated
as literal data.
The newline option that is set at compile time becomes the
default that is used for pcre_exec()
and pcre_dfa_exec()
, but it
can be overridden.
PCRE_NO_AUTO_CAPTURE
If this option is set, it disables the use of numbered capturing
parentheses in the pattern. Any opening parenthesis that is not
followed by ? behaves as if it were followed by ?: but named
parentheses can still be used for capturing (and they acquire
numbers in the usual way). There is no equivalent of this option
in Perl.
PCRE_NO_AUTO_POSSESS
If this option is set, it disables "auto-possessification". This
is an optimization that, for example, turns a+b into a++b in
order to avoid backtracks into a+ that can never be successful.
However, if callouts are in use, auto-possessification means that
some of them are never taken. You can set this option if you want
the matching functions to do a full unoptimized search and run
all the callouts, but it is mainly provided for testing purposes.
PCRE_NO_START_OPTIMIZE
This is an option that acts at matching time; that is, it is
really an option for pcre_exec()
or pcre_dfa_exec()
. If it is set
at compile time, it is remembered with the compiled pattern and
assumed at matching time. This is necessary if you want to use
JIT execution, because the JIT compiler needs to know whether or
not this option is set. For details see the discussion of
PCRE_NO_START_OPTIMIZE below.
PCRE_UCP
This option changes the way PCRE processes \B, \b, \D, \d, \S,
\s, \W, \w, and some of the POSIX character classes. By default,
only ASCII characters are recognized, but if PCRE_UCP is set,
Unicode properties are used instead to classify characters. More
details are given in the section on generic character types in
the pcrepattern
page. If you set PCRE_UCP, matching one of the
items it affects takes much longer. The option is available only
if PCRE has been compiled with Unicode property support.
PCRE_UNGREEDY
This option inverts the "greediness" of the quantifiers so that
they are not greedy by default, but become greedy if followed by
"?". It is not compatible with Perl. It can also be set by a (?U)
option setting within the pattern.
PCRE_UTF8
This option causes PCRE to regard both the pattern and the
subject as strings of UTF-8 characters instead of single-byte
strings. However, it is available only when PCRE is built to
include UTF support. If not, the use of this option provokes an
error. Details of how this option changes the behaviour of PCRE
are given in the pcreunicode
page.
PCRE_NO_UTF8_CHECK
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8
string is automatically checked. There is a discussion about the
validity of UTF-8 strings in the pcreunicode
page. If an invalid
UTF-8 sequence is found, pcre_compile()
returns an error. If you
already know that your pattern is valid, and you want to skip
this check for performance reasons, you can set the
PCRE_NO_UTF8_CHECK option. When it is set, the effect of passing
an invalid UTF-8 string as a pattern is undefined. It may cause
your program to crash or loop. Note that this option can also be
passed to pcre_exec()
and pcre_dfa_exec()
, to suppress the
validity checking of subject strings only. If the same string is
being matched many times, the option can be safely set for the
second and subsequent matchings to improve performance.