Perl-совместимые регулярные выражения (Perl-compatible regular expressions)
MATCHING A PATTERN: THE TRADITIONAL FUNCTION
int pcre_exec(const pcre *
code, const pcre_extra *
extra,
const char *
subject, int
length, int
startoffset,
int
options, int *
ovector, int
ovecsize);
The function pcre_exec()
is called to match a subject string
against a compiled pattern, which is passed in the code argument.
If the pattern was studied, the result of the study should be
passed in the extra argument. You can call pcre_exec()
with the
same code and extra arguments as many times as you like, in order
to match different subject strings with the same pattern.
This function is the main matching facility of the library, and
it operates in a Perl-like manner. For specialist use there is
also an alternative matching function, which is described below
in the section about the pcre_dfa_exec()
function.
In most applications, the pattern will have been compiled (and
optionally studied) in the same process that calls pcre_exec()
.
However, it is possible to save compiled patterns and study data,
and then use them later in different processes, possibly even on
different hosts. For a discussion about this, see the
pcreprecompile
documentation.
Here is an example of a simple call to pcre_exec()
:
int rc;
int ovector[30];
rc = pcre_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector of integers for substring
information */
30); /* number of elements (NOT size in bytes) */
Extra data for pcre_exec()
If the extra argument is not NULL, it must point to a pcre_extra
data block. The pcre_study()
function returns such a block (when
it doesn't return NULL), but you can also create one for
yourself, and pass additional information in it. The pcre_extra
block contains the following fields (not necessarily in this
order):
unsigned long int flags;
void *study_data;
void *executable_jit;
unsigned long int match_limit;
unsigned long int match_limit_recursion;
void *callout_data;
const unsigned char *tables;
unsigned char **mark;
In the 16-bit version of this structure, the mark field has type
"PCRE_UCHAR16 **".
In the 32-bit version of this structure, the mark field has type
"PCRE_UCHAR32 **".
The flags field is used to specify which of the other fields are
set. The flag bits are:
PCRE_EXTRA_CALLOUT_DATA
PCRE_EXTRA_EXECUTABLE_JIT
PCRE_EXTRA_MARK
PCRE_EXTRA_MATCH_LIMIT
PCRE_EXTRA_MATCH_LIMIT_RECURSION
PCRE_EXTRA_STUDY_DATA
PCRE_EXTRA_TABLES
Other flag bits should be set to zero. The study_data field and
sometimes the executable_jit field are set in the pcre_extra
block that is returned by pcre_study()
, together with the
appropriate flag bits. You should not set these yourself, but you
may add to the block by setting other fields and their
corresponding flag bits.
The match_limit field provides a means of preventing PCRE from
using up a vast amount of resources when running patterns that
are not going to match, but which have a very large number of
possibilities in their search trees. The classic example is a
pattern that uses nested unlimited repeats.
Internally, pcre_exec()
uses a function called match()
, which it
calls repeatedly (sometimes recursively). The limit set by
match_limit is imposed on the number of times this function is
called during a match, which has the effect of limiting the
amount of backtracking that can take place. For patterns that are
not anchored, the count restarts from zero for each position in
the subject string.
When pcre_exec()
is called with a pattern that was successfully
studied with a JIT option, the way that the matching is executed
is entirely different. However, there is still the possibility
of runaway matching that goes on for a very long time, and so the
match_limit value is also used in this case (but in a different
way) to limit how long the matching can continue.
The default value for the limit can be set when PCRE is built;
the default default is 10 million, which handles all but the most
extreme cases. You can override the default by supplying
pcre_exec()
with a pcre_extra
block in which match_limit is set,
and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the
limit is exceeded, pcre_exec()
returns PCRE_ERROR_MATCHLIMIT.
A value for the match limit may also be supplied by an item at
the start of a pattern of the form
(*LIMIT_MATCH=d)
where d is a decimal number. However, such a setting is ignored
unless d is less than the limit set by the caller of pcre_exec()
or, if no such limit is set, less than the default.
The match_limit_recursion field is similar to match_limit, but
instead of limiting the total number of times that match()
is
called, it limits the depth of recursion. The recursion depth is
a smaller number than the total number of calls, because not all
calls to match()
are recursive. This limit is of use only if it
is set smaller than match_limit.
Limiting the recursion depth limits the amount of machine stack
that can be used, or, when PCRE has been compiled to use memory
on the heap instead of the stack, the amount of heap memory that
can be used. This limit is not relevant, and is ignored, when
matching is done using JIT compiled code.
The default value for match_limit_recursion can be set when PCRE
is built; the default default is the same value as the default
for match_limit. You can override the default by supplying
pcre_exec()
with a pcre_extra
block in which
match_limit_recursion is set, and
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If
the limit is exceeded, pcre_exec()
returns
PCRE_ERROR_RECURSIONLIMIT.
A value for the recursion limit may also be supplied by an item
at the start of a pattern of the form
(*LIMIT_RECURSION=d)
where d is a decimal number. However, such a setting is ignored
unless d is less than the limit set by the caller of pcre_exec()
or, if no such limit is set, less than the default.
The callout_data field is used in conjunction with the "callout"
feature, and is described in the pcrecallout
documentation.
The tables field is provided for use with patterns that have been
pre-compiled using custom character tables, saved to disc or
elsewhere, and then reloaded, because the tables that were used
to compile a pattern are not saved with it. See the
pcreprecompile
documentation for a discussion of saving compiled
patterns for later use. If NULL is passed using this mechanism,
it forces PCRE's internal tables to be used.
Warning:
The tables that pcre_exec()
uses must be the same as
those that were used when the pattern was compiled. If this is
not the case, the behaviour of pcre_exec()
is undefined.
Therefore, when a pattern is compiled and matched in the same
process, this field should never be set. In this (the most
common) case, the correct table pointer is automatically passed
with the compiled pattern from pcre_compile()
to pcre_exec()
.
If PCRE_EXTRA_MARK is set in the flags field, the mark field must
be set to point to a suitable variable. If the pattern contains
any backtracking control verbs such as (*MARK:NAME), and the
execution ends up with a name to pass back, a pointer to the name
string (zero terminated) is placed in the variable pointed to by
the mark field. The names are within the compiled pattern; if you
wish to retain such a name you must copy it before freeing the
memory of a compiled pattern. If there is no name to pass back,
the variable pointed to by the mark field is set to NULL. For
details of the backtracking control verbs, see the section
entitled "Backtracking control" in the pcrepattern
documentation.
Option bits for pcre_exec()
The unused bits of the options argument for pcre_exec()
must be
zero. The only bits that may be set are PCRE_ANCHORED,
PCRE_NEWLINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
PCRE_NOTEMPTY_ATSTART, PCRE_NO_START_OPTIMIZE,
PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT.
If the pattern was successfully studied with one of the just-in-
time (JIT) compile options, the only supported options for JIT
execution are PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL,
PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and
PCRE_PARTIAL_SOFT. If an unsupported option is used, JIT
execution is disabled and the normal interpretive code in
pcre_exec()
is run.
PCRE_ANCHORED
The PCRE_ANCHORED option limits pcre_exec()
to matching at the
first matching position. If a pattern was compiled with
PCRE_ANCHORED, or turned out to be anchored by virtue of its
contents, it cannot be made unachored at matching time.
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
These options (which are mutually exclusive) control what the \R
escape sequence matches. The choice is either to match only CR,
LF, or CRLF, or to match any Unicode newline sequence. These
options override the choice that was made or defaulted when the
pattern was compiled.
PCRE_NEWLINE_CR
PCRE_NEWLINE_LF
PCRE_NEWLINE_CRLF
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
These options override the newline definition that was chosen or
defaulted when the pattern was compiled. For details, see the
description of pcre_compile()
above. During matching, the newline
choice affects the behaviour of the dot, circumflex, and dollar
metacharacters. It may also alter the way the match position is
advanced after a match failure for an unanchored pattern.
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY
is set, and a match attempt for an unanchored pattern fails when
the current position is at a CRLF sequence, and the pattern
contains no explicit matches for CR or LF characters, the match
position is advanced by two characters instead of one, in other
words, to after the CRLF.
The above rule is a compromise that makes the most common cases
work as expected. For example, if the pattern is .+A (and the
PCRE_DOTALL option is not set), it does not match the string
"\r\nA" because, after failing at the start, it skips both the CR
and the LF before retrying. However, the pattern [\r\n]A does
match that string, because it contains an explicit CR or LF
reference, and so advances only by one character after the first
failure.
An explicit match for CR of LF is either a literal appearance of
one of those characters, or one of the \r or \n escape sequences.
Implicit matches such as [^X] do not count, nor does \s (which
includes CR and LF in the characters that it matches).
Notwithstanding the above, anomalous effects may still occur when
CRLF is a valid newline sequence and explicit \r or \n escapes
appear in the pattern.
PCRE_NOTBOL
This option specifies that first character of the subject string
is not the beginning of a line, so the circumflex metacharacter
should not match before it. Setting this without PCRE_MULTILINE
(at compile time) causes circumflex never to match. This option
affects only the behaviour of the circumflex metacharacter. It
does not affect \A.
PCRE_NOTEOL
This option specifies that the end of the subject string is not
the end of a line, so the dollar metacharacter should not match
it nor (except in multiline mode) a newline immediately before
it. Setting this without PCRE_MULTILINE (at compile time) causes
dollar never to match. This option affects only the behaviour of
the dollar metacharacter. It does not affect \Z or \z.
PCRE_NOTEMPTY
An empty string is not considered to be a valid match if this
option is set. If there are alternatives in the pattern, they are
tried. If all the alternatives match the empty string, the entire
match fails. For example, if the pattern
a?b?
is applied to a string not beginning with "a" or "b", it matches
an empty string at the start of the subject. With PCRE_NOTEMPTY
set, this match is not valid, so PCRE searches further into the
string for occurrences of "a" or "b".
PCRE_NOTEMPTY_ATSTART
This is like PCRE_NOTEMPTY, except that an empty string match
that is not at the start of the subject is permitted. If the
pattern is anchored, such a match can occur only if the pattern
contains \K.
Perl has no direct equivalent of PCRE_NOTEMPTY or
PCRE_NOTEMPTY_ATSTART, but it does make a special case of a
pattern match of the empty string within its split()
function,
and when using the /g modifier. It is possible to emulate Perl's
behaviour after matching a null string by first trying the match
again at the same offset with PCRE_NOTEMPTY_ATSTART and
PCRE_ANCHORED, and then if that fails, by advancing the starting
offset (see below) and trying an ordinary match again. There is
some code that demonstrates how to do this in the pcredemo
sample
program. In the most general case, you have to check to see if
the newline convention recognizes CRLF as a newline, and if so,
and the current character is CR followed by LF, advance the
starting offset by two characters instead of one.
PCRE_NO_START_OPTIMIZE
There are a number of optimizations that pcre_exec()
uses at the
start of a match, in order to speed up the process. For example,
if it is known that an unanchored match must start with a
specific character, it searches the subject for that character,
and fails immediately if it cannot find it, without actually
running the main matching function. This means that a special
item such as (*COMMIT) at the start of a pattern is not
considered until after a suitable starting point for the match
has been found. Also, when callouts or (*MARK) items are in use,
these "start-up" optimizations can cause them to be skipped if
the pattern is never actually used. The start-up optimizations
are in effect a pre-scan of the subject that takes place before
the pattern is run.
The PCRE_NO_START_OPTIMIZE option disables the start-up
optimizations, possibly causing performance to suffer, but
ensuring that in cases where the result is "no match", the
callouts do occur, and that items such as (*COMMIT) and (*MARK)
are considered at every possible starting position in the subject
string. If PCRE_NO_START_OPTIMIZE is set at compile time, it
cannot be unset at matching time. The use of
PCRE_NO_START_OPTIMIZE at matching time (that is, passing it to
pcre_exec()
) disables JIT execution; in this situation, matching
is always done using interpretively.
Setting PCRE_NO_START_OPTIMIZE can change the outcome of a
matching operation. Consider the pattern
(*COMMIT)ABC
When this is compiled, PCRE records the fact that a match must
start with the character "A". Suppose the subject string is
"DEFABC". The start-up optimization scans along the subject,
finds "A" and runs the first match attempt from there. The
(*COMMIT) item means that the pattern must match the current
starting position, which in this case, it does. However, if the
same match is run with PCRE_NO_START_OPTIMIZE set, the initial
scan along the subject string does not happen. The first match
attempt is run starting from "D" and when this fails, (*COMMIT)
prevents any further matches being tried, so the overall result
is "no match". If the pattern is studied, more start-up
optimizations may be used. For example, a minimum length for the
subject may be recorded. Consider the pattern
(*MARK:A)(X|Y)
The minimum length for a match is one character. If the subject
is "ABC", there will be attempts to match "ABC", "BC", "C", and
then finally an empty string. If the pattern is studied, the
final attempt does not take place, because PCRE knows that the
subject is too short, and so the (*MARK) is never encountered.
In this case, studying the pattern does not affect the overall
match result, which is still "no match", but it does affect the
auxiliary information that is returned.
PCRE_NO_UTF8_CHECK
When PCRE_UTF8 is set at compile time, the validity of the
subject as a UTF-8 string is automatically checked when
pcre_exec()
is subsequently called. The entire string is checked
before any other processing takes place. The value of startoffset
is also checked to ensure that it points to the start of a UTF-8
character. There is a discussion about the validity of UTF-8
strings in the pcreunicode
page. If an invalid sequence of bytes
is found, pcre_exec()
returns the error PCRE_ERROR_BADUTF8 or, if
PCRE_PARTIAL_HARD is set and the problem is a truncated character
at the end of the subject, PCRE_ERROR_SHORTUTF8. In both cases,
information about the precise nature of the error may also be
returned (see the descriptions of these errors in the section
entitled Error return values from pcre_exec()
below). If
startoffset contains a value that does not point to the start of
a UTF-8 character (or to the end of the subject),
PCRE_ERROR_BADUTF8_OFFSET is returned.
If you already know that your subject is valid, and you want to
skip these checks for performance reasons, you can set the
PCRE_NO_UTF8_CHECK option when calling pcre_exec()
. You might
want to do this for the second and subsequent calls to
pcre_exec()
if you are making repeated calls to find all the
matches in a single subject string. However, you should be sure
that the value of startoffset points to the start of a character
(or the end of the subject). When PCRE_NO_UTF8_CHECK is set, the
effect of passing an invalid string as a subject or an invalid
value of startoffset is undefined. Your program may crash or
loop.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
These options turn on the partial matching feature. For backwards
compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A
partial match occurs if the end of the subject string is reached
successfully, but there are not enough subject characters to
complete the match. If this happens when PCRE_PARTIAL_SOFT (but
not PCRE_PARTIAL_HARD) is set, matching continues by testing any
remaining alternatives. Only if no complete match can be found is
PCRE_ERROR_PARTIAL returned instead of PCRE_ERROR_NOMATCH. In
other words, PCRE_PARTIAL_SOFT says that the caller is prepared
to handle a partial match, but only if no complete match can be
found.
If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In
this case, if a partial match is found, pcre_exec()
immediately
returns PCRE_ERROR_PARTIAL, without considering any other
alternatives. In other words, when PCRE_PARTIAL_HARD is set, a
partial match is considered to be more important that an
alternative complete match.
In both cases, the portion of the string that was inspected when
the partial match was found is set as the first matching string.
There is a more detailed discussion of partial and multi-segment
matching, with examples, in the pcrepartial
documentation.
The string to be matched by pcre_exec()
The subject string is passed to pcre_exec()
as a pointer in
subject, a length in length, and a starting offset in
startoffset. The units for length and startoffset are bytes for
the 8-bit library, 16-bit data items for the 16-bit library, and
32-bit data items for the 32-bit library.
If startoffset is negative or greater than the length of the
subject, pcre_exec()
returns PCRE_ERROR_BADOFFSET. When the
starting offset is zero, the search for a match starts at the
beginning of the subject, and this is by far the most common
case. In UTF-8 or UTF-16 mode, the offset must point to the start
of a character, or the end of the subject (in UTF-32 mode, one
data unit equals one character, so all offsets are valid). Unlike
the pattern string, the subject may contain binary zeroes.
A non-zero starting offset is useful when searching for another
match in the same subject by calling pcre_exec()
again after a
previous success. Setting startoffset differs from just passing
over a shortened string and setting PCRE_NOTBOL in the case of a
pattern that begins with any kind of lookbehind. For example,
consider the pattern
\Biss\B
which finds occurrences of "iss" in the middle of words. (\B
matches only if the current position in the subject is not a word
boundary.) When applied to the string "Mississippi" the first
call to pcre_exec()
finds the first occurrence. If pcre_exec()
is
called again with just the remainder of the subject, namely
"issippi", it does not match, because \B is always false at the
start of the subject, which is deemed to be a word boundary.
However, if pcre_exec()
is passed the entire string again, but
with startoffset set to 4, it finds the second occurrence of
"iss" because it is able to look behind the starting point to
discover that it is preceded by a letter.
Finding all the matches in a subject is tricky when the pattern
can match an empty string. It is possible to emulate Perl's /g
behaviour by first trying the match again at the same offset,
with the PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and
then if that fails, advancing the starting offset and trying an
ordinary match again. There is some code that demonstrates how to
do this in the pcredemo
sample program. In the most general case,
you have to check to see if the newline convention recognizes
CRLF as a newline, and if so, and the current character is CR
followed by LF, advance the starting offset by two characters
instead of one.
If a non-zero starting offset is passed when the pattern is
anchored, one attempt to match at the given offset is made. This
can only succeed if the pattern does not require the match to be
at the start of the subject.
How pcre_exec() returns captured substrings
In general, a pattern matches a certain portion of the subject,
and in addition, further substrings from the subject may be
picked out by parts of the pattern. Following the usage in
Jeffrey Friedl's book, this is called "capturing" in what
follows, and the phrase "capturing subpattern" is used for a
fragment of a pattern that picks out a substring. PCRE supports
several other kinds of parenthesized subpattern that do not cause
substrings to be captured.
Captured substrings are returned to the caller via a vector of
integers whose address is passed in ovector. The number of
elements in the vector is passed in ovecsize, which must be a
non-negative number. Note
: this argument is NOT the size of
ovector in bytes.
The first two-thirds of the vector is used to pass back captured
substrings, each substring using a pair of integers. The
remaining third of the vector is used as workspace by pcre_exec()
while matching capturing subpatterns, and is not available for
passing back information. The number passed in ovecsize should
always be a multiple of three. If it is not, it is rounded down.
When a match is successful, information about captured substrings
is returned in pairs of integers, starting at the beginning of
ovector, and continuing up to two-thirds of its length at the
most. The first element of each pair is set to the offset of the
first character in a substring, and the second is set to the
offset of the first character after the end of a substring. These
values are always data unit offsets, even in UTF mode. They are
byte offsets in the 8-bit library, 16-bit data item offsets in
the 16-bit library, and 32-bit data item offsets in the 32-bit
library. Note
: they are not character counts.
The first pair of integers, ovector[0] and ovector[1], identify
the portion of the subject string matched by the entire pattern.
The next pair is used for the first capturing subpattern, and so
on. The value returned by pcre_exec()
is one more than the
highest numbered pair that has been set. For example, if two
substrings have been captured, the returned value is 3. If there
are no capturing subpatterns, the return value from a successful
match is 1, indicating that just the first pair of offsets has
been set.
If a capturing subpattern is matched repeatedly, it is the last
portion of the string that it matched that is returned.
If the vector is too small to hold all the captured substring
offsets, it is used as far as possible (up to two-thirds of its
length), and the function returns a value of zero. If neither the
actual string matched nor any captured substrings are of
interest, pcre_exec()
may be called with ovector passed as NULL
and ovecsize as zero. However, if the pattern contains back
references and the ovector is not big enough to remember the
related substrings, PCRE has to get additional memory for use
during matching. Thus it is usually advisable to supply an
ovector of reasonable size.
There are some cases where zero is returned (indicating vector
overflow) when in fact the vector is exactly the right size for
the final match. For example, consider the pattern
(a)(?:(b)c|bd)
If a vector of 6 elements (allowing for only 1 captured
substring) is given with subject string "abd", pcre_exec()
will
try to set the second captured string, thereby recording a vector
overflow, before failing to match "c" and backing up to try the
second alternative. The zero return, however, does correctly
indicate that the maximum number of slots (namely 2) have been
filled. In similar cases where there is temporary overflow, but
the final number of used slots is actually less than the maximum,
a non-zero value is returned.
The pcre_fullinfo()
function can be used to find out how many
capturing subpatterns there are in a compiled pattern. The
smallest size for ovector that will allow for n captured
substrings, in addition to the offsets of the substring matched
by the whole pattern, is (n+1)*3.
It is possible for capturing subpattern number n+1 to match some
part of the subject when subpattern n has not been used at all.
For example, if the string "abc" is matched against the pattern
(a|(z))(bc) the return from the function is 4, and subpatterns 1
and 3 are matched, but 2 is not. When this happens, both values
in the offset pairs corresponding to unused subpatterns are set
to -1.
Offset values that correspond to unused subpatterns at the end of
the expression are also set to -1. For example, if the string
"abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2
and 3 are not matched. The return from the function is 2, because
the highest used capturing subpattern number is 1, and the
offsets for for the second and third capturing subpatterns
(assuming the vector is large enough, of course) are set to -1.
Note
: Elements in the first two-thirds of ovector that do not
correspond to capturing parentheses in the pattern are never
changed. That is, if a pattern contains n capturing parentheses,
no more than ovector[0] to ovector[2n+1] are set by pcre_exec()
.
The other elements (in the first two-thirds) retain whatever
values they previously had.
Some convenience functions are provided for extracting the
captured substrings as separate strings. These are described
below.
Error return values from pcre_exec()
If pcre_exec()
fails, it returns a negative number. The following
are defined in the header file:
PCRE_ERROR_NOMATCH (-1)
The subject string did not match the pattern.
PCRE_ERROR_NULL (-2)
Either code or subject was passed as NULL, or ovector was NULL
and ovecsize was not zero.
PCRE_ERROR_BADOPTION (-3)
An unrecognized bit was set in the options argument.
PCRE_ERROR_BADMAGIC (-4)
PCRE stores a 4-byte "magic number" at the start of the compiled
code, to catch the case when it is passed a junk pointer and to
detect when a pattern that was compiled in an environment of one
endianness is run in an environment with the other endianness.
This is the error that PCRE gives when the magic number is not
present.
PCRE_ERROR_UNKNOWN_OPCODE (-5)
While running the pattern match, an unknown item was encountered
in the compiled pattern. This error could be caused by a bug in
PCRE or by overwriting of the compiled pattern.
PCRE_ERROR_NOMEMORY (-6)
If a pattern contains back references, but the ovector that is
passed to pcre_exec()
is not big enough to remember the
referenced substrings, PCRE gets a block of memory at the start
of matching to use for this purpose. If the call via
pcre_malloc()
fails, this error is given. The memory is
automatically freed at the end of matching.
This error is also given if pcre_stack_malloc()
fails in
pcre_exec()
. This can happen only when PCRE has been compiled
with --disable-stack-for-recursion
.
PCRE_ERROR_NOSUBSTRING (-7)
This error is used by the pcre_copy_substring()
,
pcre_get_substring()
, and pcre_get_substring_list()
functions
(see below). It is never returned by pcre_exec()
.
PCRE_ERROR_MATCHLIMIT (-8)
The backtracking limit, as specified by the match_limit field in
a pcre_extra
structure (or defaulted) was reached. See the
description above.
PCRE_ERROR_CALLOUT (-9)
This error is never generated by pcre_exec()
itself. It is
provided for use by callout functions that want to yield a
distinctive error code. See the pcrecallout
documentation for
details.
PCRE_ERROR_BADUTF8 (-10)
A string that contains an invalid UTF-8 byte sequence was passed
as a subject, and the PCRE_NO_UTF8_CHECK option was not set. If
the size of the output vector (ovecsize) is at least 2, the byte
offset to the start of the the invalid UTF-8 character is placed
in the first element, and a reason code is placed in the second
element. The reason codes are listed in the following section.
For backward compatibility, if PCRE_PARTIAL_HARD is set and the
problem is a truncated UTF-8 character at the end of the subject
(reason codes 1 to 5), PCRE_ERROR_SHORTUTF8 is returned instead
of PCRE_ERROR_BADUTF8.
PCRE_ERROR_BADUTF8_OFFSET (-11)
The UTF-8 byte sequence that was passed as a subject was checked
and found to be valid (the PCRE_NO_UTF8_CHECK option was not
set), but the value of startoffset did not point to the beginning
of a UTF-8 character or the end of the subject.
PCRE_ERROR_PARTIAL (-12)
The subject string did not match, but it did match partially. See
the pcrepartial
documentation for details of partial matching.
PCRE_ERROR_BADPARTIAL (-13)
This code is no longer in use. It was formerly returned when the
PCRE_PARTIAL option was used with a compiled pattern containing
items that were not supported for partial matching. From release
8.00 onwards, there are no restrictions on partial matching.
PCRE_ERROR_INTERNAL (-14)
An unexpected internal error has occurred. This error could be
caused by a bug in PCRE or by overwriting of the compiled
pattern.
PCRE_ERROR_BADCOUNT (-15)
This error is given if the value of the ovecsize argument is
negative.
PCRE_ERROR_RECURSIONLIMIT (-21)
The internal recursion limit, as specified by the
match_limit_recursion field in a pcre_extra
structure (or
defaulted) was reached. See the description above.
PCRE_ERROR_BADNEWLINE (-23)
An invalid combination of PCRE_NEWLINE_xxx options was given.
PCRE_ERROR_BADOFFSET (-24)
The value of startoffset was negative or greater than the length
of the subject, that is, the value in length.
PCRE_ERROR_SHORTUTF8 (-25)
This error is returned instead of PCRE_ERROR_BADUTF8 when the
subject string ends with a truncated UTF-8 character and the
PCRE_PARTIAL_HARD option is set. Information about the failure
is returned as for PCRE_ERROR_BADUTF8. It is in fact sufficient
to detect this case, but this special error code for
PCRE_PARTIAL_HARD precedes the implementation of returned
information; it is retained for backwards compatibility.
PCRE_ERROR_RECURSELOOP (-26)
This error is returned when pcre_exec()
detects a recursion loop
within the pattern. Specifically, it means that either the whole
pattern or a subpattern has been called recursively for the
second time at the same position in the subject string. Some
simple patterns that might do this are detected and faulted at
compile time, but more complicated cases, in particular mutual
recursions between two different subpatterns, cannot be detected
until run time.
PCRE_ERROR_JIT_STACKLIMIT (-27)
This error is returned when a pattern that was successfully
studied using a JIT compile option is being matched, but the
memory available for the just-in-time processing stack is not
large enough. See the pcrejit
documentation for more details.
PCRE_ERROR_BADMODE (-28)
This error is given if a pattern that was compiled by the 8-bit
library is passed to a 16-bit or 32-bit library function, or vice
versa.
PCRE_ERROR_BADENDIANNESS (-29)
This error is given if a pattern that was compiled and saved is
reloaded on a host with different endianness. The utility
function pcre_pattern_to_host_byte_order()
can be used to convert
such a pattern so that it runs on the new host.
PCRE_ERROR_JIT_BADOPTION
This error is returned when a pattern that was successfully
studied using a JIT compile option is being matched, but the
matching mode (partial or complete match) does not correspond to
any JIT compilation mode. When the JIT fast path function is
used, this error may be also given for invalid options. See the
pcrejit
documentation for more details.
PCRE_ERROR_BADLENGTH (-32)
This error is given if pcre_exec()
is called with a negative
value for the length argument.
Error numbers -16 to -20, -22, and 30 are not used by
pcre_exec()
.
Reason codes for invalid UTF-8 strings
This section applies only to the 8-bit library. The corresponding
information for the 16-bit and 32-bit libraries is given in the
pcre16
and pcre32
pages.
When pcre_exec()
returns either PCRE_ERROR_BADUTF8 or
PCRE_ERROR_SHORTUTF8, and the size of the output vector
(ovecsize) is at least 2, the offset of the start of the invalid
UTF-8 character is placed in the first output vector element
(ovector[0]) and a reason code is placed in the second element
(ovector[1]). The reason codes are given names in the pcre.h
header file:
PCRE_UTF8_ERR1
PCRE_UTF8_ERR2
PCRE_UTF8_ERR3
PCRE_UTF8_ERR4
PCRE_UTF8_ERR5
The string ends with a truncated UTF-8 character; the code
specifies how many bytes are missing (1 to 5). Although RFC 3629
restricts UTF-8 characters to be no longer than 4 bytes, the
encoding scheme (originally defined by RFC 2279) allows for up to
6 bytes, and this is checked first; hence the possibility of 4 or
5 missing bytes.
PCRE_UTF8_ERR6
PCRE_UTF8_ERR7
PCRE_UTF8_ERR8
PCRE_UTF8_ERR9
PCRE_UTF8_ERR10
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th
byte of the character do not have the binary value 0b10 (that is,
either the most significant bit is 0, or the next bit is 1).
PCRE_UTF8_ERR11
PCRE_UTF8_ERR12
A character that is valid by the RFC 2279 rules is either 5 or 6
bytes long; these code points are excluded by RFC 3629.
PCRE_UTF8_ERR13
A 4-byte character has a value greater than 0x10fff; these code
points are excluded by RFC 3629.
PCRE_UTF8_ERR14
A 3-byte character has a value in the range 0xd800 to 0xdfff;
this range of code points are reserved by RFC 3629 for use with
UTF-16, and so are excluded from UTF-8.
PCRE_UTF8_ERR15
PCRE_UTF8_ERR16
PCRE_UTF8_ERR17
PCRE_UTF8_ERR18
PCRE_UTF8_ERR19
A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it
codes for a value that can be represented by fewer bytes, which
is invalid. For example, the two bytes 0xc0, 0xae give the value
0x2e, whose correct coding uses just one byte.
PCRE_UTF8_ERR20
The two most significant bits of the first byte of a character
have the binary value 0b10 (that is, the most significant bit is
1 and the second is 0). Such a byte can only validly occur as the
second or subsequent byte of a multi-byte character.
PCRE_UTF8_ERR21
The first byte of a character has the value 0xfe or 0xff. These
values can never occur in a valid UTF-8 string.
PCRE_UTF8_ERR22
This error code was formerly used when the presence of a so-
called "non-character" caused an error. Unicode corrigendum #9
makes it clear that such characters should not cause a string to
be rejected, and so this code is no longer in use and is never
returned.