`pcreapi` ( 3 )

Perl-совместимые регулярные выражения (Perl-compatible regular expressions)

COMPILING A PATTERN

Формат

pcre *pcre_compile(const char *pattern, int options, const char **errptr, int *erroffset, const unsigned char *tableptr);

pcre *pcre_compile2(const char *pattern, int options, int *errorcodeptr, const char **errptr, int *erroffset, const unsigned char *tableptr);

Either of the functions pcre_compile() or pcre_compile2() can be called to compile a pattern into an internal form. The only difference between the two interfaces is that pcre_compile2() has an additional argument, errorcodeptr, via which a numerical error code can be returned. To avoid too much repetition, we refer just to pcre_compile() below, but the information applies equally to pcre_compile2().

The pattern is a C string terminated by a binary zero, and is passed in the pattern argument. A pointer to a single block of memory that is obtained via pcre_malloc is returned. This contains the compiled code and related data. The pcre type is defined for the returned block; this is a typedef for a structure whose contents are not externally defined. It is up to the caller to free the memory (via pcre_free) when it is no longer required.

Although the compiled code of a PCRE regex is relocatable, that is, it does not depend on memory location, the complete pcre data block is not fully relocatable, because it may contain a copy of the tableptr argument, which is an address (see below).

The options argument contains various bit settings that affect the compilation. It should be zero if no options are required. The available options are described below. Some of them (in particular, those that are compatible with Perl, but some others as well) can also be set and unset from within the pattern (see the detailed description in the pcrepattern documentation). For those options that can be different in different parts of the pattern, the contents of the options argument specifies their settings at the start of compilation and execution. The PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and PCRE_NO_START_OPTIMIZE options can be set at the time of matching as well as at compile time.

If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, if compilation of a pattern fails, pcre_compile() returns NULL, and sets the variable pointed to by errptr to point to a textual error message. This is a static string that is part of the library. You must not try to free it. Normally, the offset from the start of the pattern to the data unit that was being processed when the error was discovered is placed in the variable pointed to by erroffset, which must not be NULL (if it is, an immediate error is given). However, for an invalid UTF-8 or UTF-16 string, the offset is that of the first data unit of the failing character.

Some errors are not detected until the whole pattern has been scanned; in these cases, the offset passed back is the length of the pattern. Note that the offset is in data units, not characters, even in a UTF mode. It may sometimes point into the middle of a UTF-8 or UTF-16 character.

If pcre_compile2() is used instead of pcre_compile(), and the errorcodeptr argument is not NULL, a non-zero error code number is returned via this argument in the event of an error. This is in addition to the textual error message. Error codes and messages are listed below.

If the final argument, tableptr, is NULL, PCRE uses a default set of character tables that are built when PCRE is compiled, using the default C locale. Otherwise, tableptr must be an address that is the result of a call to pcre_maketables(). This value is stored with the compiled pattern, and used again by pcre_exec() and pcre_dfa_exec() when the pattern is matched. For more discussion, see the section on locale support below.

This code fragment shows a typical straightforward call to pcre_compile():

pcre *re; const char *error; int erroffset; re = pcre_compile( "^A.*Z", /* the pattern */ 0, /* default options */ &error, /* for error message */ &erroffset, /* for error offset */ NULL); /* use default character tables */

The following names for option bits are defined in the pcre.h header file:

PCRE_ANCHORED

If this bit is set, the pattern is forced to be "anchored", that is, it is constrained to match only at the first matching point in the string that is being searched (the "subject string"). This effect can also be achieved by appropriate constructs in the pattern itself, which is the only way to do it in Perl.

PCRE_AUTO_CALLOUT

If this bit is set, pcre_compile() automatically inserts callout items, all with number 255, before each pattern item. For discussion of the callout facility, see the pcrecallout documentation.

PCRE_BSR_ANYCRLF PCRE_BSR_UNICODE

These options (which are mutually exclusive) control what the \R escape sequence matches. The choice is either to match only CR, LF, or CRLF, or to match any Unicode newline sequence. The default is specified when PCRE is built. It can be overridden from within the pattern, or by setting an option when a compiled pattern is matched.

PCRE_CASELESS

If this bit is set, letters in the pattern match both upper and lower case letters. It is equivalent to Perl's /i option, and it can be changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE always understands the concept of case for characters whose values are less than 128, so caseless matching is always possible. For characters with higher values, the concept of case is supported if PCRE is compiled with Unicode property support, but not otherwise. If you want to use caseless matching for characters 128 and above, you must ensure that PCRE is compiled with Unicode property support as well as with UTF-8 support.

PCRE_DOLLAR_ENDONLY

If this bit is set, a dollar metacharacter in the pattern matches only at the end of the subject string. Without this option, a dollar also matches immediately before a newline at the end of the string (but not before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set. There is no equivalent to this option in Perl, and no way to set it within a pattern.

PCRE_DOTALL

If this bit is set, a dot metacharacter in the pattern matches a character of any value, including one that indicates a newline. However, it only ever matches one character, even if newlines are coded as CRLF. Without this option, a dot does not match when the current position is at a newline. This option is equivalent to Perl's /s option, and it can be changed within a pattern by a (?s) option setting. A negative class such as [^a] always matches newline characters, independent of the setting of this option.

PCRE_DUPNAMES

If this bit is set, names used to identify capturing subpatterns need not be unique. This can be helpful for certain types of pattern when it is known that only one instance of the named subpattern can ever be matched. There are more details of named subpatterns below; see also the pcrepattern documentation.

PCRE_EXTENDED

If this bit is set, most white space characters in the pattern are totally ignored except when escaped or inside a character class. However, white space is not allowed within sequences such as (?> that introduce various parenthesized subpatterns, nor within a numerical quantifier such as {1,3}. However, ignorable white space is permitted between an item and a following quantifier and between a quantifier and a following + that indicates possessiveness.

White space did not used to include the VT character (code 11), because Perl did not treat this character as white space. However, Perl changed at release 5.18, so PCRE followed at release 8.34, and VT is now treated as white space.

PCRE_EXTENDED also causes characters between an unescaped # outside a character class and the next newline, inclusive, to be ignored. PCRE_EXTENDED is equivalent to Perl's /x option, and it can be changed within a pattern by a (?x) option setting.

Which characters are interpreted as newlines is controlled by the options passed to pcre_compile() or by a special sequence at the start of the pattern, as described in the section entitled "Newline conventions" in the pcrepattern documentation. Note that the end of this type of comment is a literal newline sequence in the pattern; escape sequences that happen to represent a newline do not count.

This option makes it possible to include comments inside complicated patterns. Note, however, that this applies only to data characters. White space characters may never appear within special character sequences in a pattern, for example within the sequence (?( that introduces a conditional subpattern.

PCRE_EXTRA

This option was invented in order to turn on additional functionality of PCRE that is incompatible with Perl, but it is currently of very little use. When set, any backslash in a pattern that is followed by a letter that has no special meaning causes an error, thus reserving these combinations for future expansion. By default, as in Perl, a backslash followed by a letter with no special meaning is treated as a literal. (Perl can, however, be persuaded to give an error for this, by running it with the -w option.) There are at present no other features controlled by this option. It can also be set by a (?X) option setting within a pattern.

PCRE_FIRSTLINE

If this option is set, an unanchored pattern is required to match before or at the first newline in the subject string, though the matched text may continue over the newline.

PCRE_JAVASCRIPT_COMPAT

If this option is set, PCRE's behaviour is changed in some ways so that it is compatible with JavaScript rather than Perl. The changes are as follows:

(1) A lone closing square bracket in a pattern causes a compile- time error, because this is illegal in JavaScript (by default it is treated as a data character). Thus, the pattern AB]CD becomes illegal when this option is set.

(2) At run time, a back reference to an unset subpattern group matches an empty string (by default this causes the current matching alternative to fail). A pattern such as (\1)(a) succeeds when this option is set (assuming it can find an "a" in the subject), whereas it fails by default, for Perl compatibility.

(3) \U matches an upper case "U" character; by default \U causes a compile time error (Perl uses \U to upper case subsequent characters).

(4) \u matches a lower case "u" character unless it is followed by four hexadecimal digits, in which case the hexadecimal number defines the code point to match. By default, \u causes a compile time error (Perl uses it to upper case the following character).

(5) \x matches a lower case "x" character unless it is followed by two hexadecimal digits, in which case the hexadecimal number defines the code point to match. By default, as in Perl, a hexadecimal number is always expected after \x, but it may have zero, one, or two digits (so, for example, \xz matches a binary zero character followed by z).

PCRE_MULTILINE

By default, for the purposes of matching "start of line" and "end of line", PCRE treats the subject string as consisting of a single line of characters, even if it actually contains newlines. The "start of line" metacharacter (^) matches only at the start of the string, and the "end of line" metacharacter ($) matches only at the end of the string, or before a terminating newline (except when PCRE_DOLLAR_ENDONLY is set). Note, however, that unless PCRE_DOTALL is set, the "any character" metacharacter (.) does not match at a newline. This behaviour (for ^, $, and dot) is the same as Perl.

When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs match immediately following or immediately before internal newlines in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl's /m option, and it can be changed within a pattern by a (?m) option setting. If there are no newlines in a subject string, or no occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no effect.

PCRE_NEVER_UTF

This option locks out interpretation of the pattern as UTF-8 (or UTF-16 or UTF-32 in the 16-bit and 32-bit libraries). In particular, it prevents the creator of the pattern from switching to UTF interpretation by starting the pattern with (*UTF). This may be useful in applications that process patterns from external sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also causes an error.

PCRE_NEWLINE_CR PCRE_NEWLINE_LF PCRE_NEWLINE_CRLF PCRE_NEWLINE_ANYCRLF PCRE_NEWLINE_ANY

These options override the default newline definition that was chosen when PCRE was built. Setting the first or the second specifies that a newline is indicated by a single character (CR or LF, respectively). Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies that any of the three preceding sequences should be recognized. Setting PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be recognized.

In an ASCII/Unicode environment, the Unicode newline sequences are the three just mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS (paragraph separator, U+2029). For the 8-bit library, the last two are recognized only in UTF-8 mode.

When PCRE is compiled to run in an EBCDIC (mainframe) environment, the code for CR is 0x0d, the same as ASCII. However, the character code for LF is normally 0x15, though in some EBCDIC environments 0x25 is used. Whichever of these is not LF is made to correspond to Unicode's NEL character. EBCDIC codes are all less than 256. For more details, see the pcrebuild documentation.

The newline setting in the options word uses three bits that are treated as a number, giving eight possibilities. Currently only six are used (default plus the five values above). This means that if you set more than one newline option, the combination may or may not be sensible. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and cause an error.

The only time that a line break in a pattern is specially recognized when compiling is when PCRE_EXTENDED is set. CR and LF are white space characters, and so are ignored in this mode. Also, an unescaped # outside a character class indicates a comment that lasts until after the next line break sequence. In other circumstances, line break sequences in patterns are treated as literal data.

The newline option that is set at compile time becomes the default that is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.

PCRE_NO_AUTO_CAPTURE

If this option is set, it disables the use of numbered capturing parentheses in the pattern. Any opening parenthesis that is not followed by ? behaves as if it were followed by ?: but named parentheses can still be used for capturing (and they acquire numbers in the usual way). There is no equivalent of this option in Perl.

PCRE_NO_AUTO_POSSESS

If this option is set, it disables "auto-possessification". This is an optimization that, for example, turns a+b into a++b in order to avoid backtracks into a+ that can never be successful. However, if callouts are in use, auto-possessification means that some of them are never taken. You can set this option if you want the matching functions to do a full unoptimized search and run all the callouts, but it is mainly provided for testing purposes.

PCRE_NO_START_OPTIMIZE

This is an option that acts at matching time; that is, it is really an option for pcre_exec() or pcre_dfa_exec(). If it is set at compile time, it is remembered with the compiled pattern and assumed at matching time. This is necessary if you want to use JIT execution, because the JIT compiler needs to know whether or not this option is set. For details see the discussion of PCRE_NO_START_OPTIMIZE below.

PCRE_UCP

This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By default, only ASCII characters are recognized, but if PCRE_UCP is set, Unicode properties are used instead to classify characters. More details are given in the section on generic character types in the pcrepattern page. If you set PCRE_UCP, matching one of the items it affects takes much longer. The option is available only if PCRE has been compiled with Unicode property support.

PCRE_UNGREEDY

This option inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by "?". It is not compatible with Perl. It can also be set by a (?U) option setting within the pattern.

PCRE_UTF8

This option causes PCRE to regard both the pattern and the subject as strings of UTF-8 characters instead of single-byte strings. However, it is available only when PCRE is built to include UTF support. If not, the use of this option provokes an error. Details of how this option changes the behaviour of PCRE are given in the pcreunicode page.

PCRE_NO_UTF8_CHECK

When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is automatically checked. There is a discussion about the validity of UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence is found, pcre_compile() returns an error. If you already know that your pattern is valid, and you want to skip this check for performance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of passing an invalid UTF-8 string as a pattern is undefined. It may cause your program to crash or loop. Note that this option can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity checking of subject strings only. If the same string is being matched many times, the option can be safely set for the second and subsequent matchings to improve performance.

Исходный текст на man7.org

pcreapi ( 3 )

COMPILING A PATTERN

`pcreapi` ( 3 )