Путеводитель по Руководству Linux

  User  |  Syst  |  Libr  |  Device  |  Files  |  Other  |  Admin  |  Head  |



   pcreapi    ( 3 )

Perl-совместимые регулярные выражения (Perl-compatible regular expressions)

  Name  |  Pcre native api basic functions  |  Pcre native api string extraction functions  |  Pcre native api auxiliary functions  |  Pcre native api indirected functions  |  Pcre 8-bit, 16-bit, and 32-bit libraries  |  Pcre api overview  |  Newlines  |  Multithreading  |  Saving precompiled patterns for later use  |  Checking build-time options  |  Compiling a pattern  |  Compilation error codes  |  Studying a pattern  |  Locale support  |    Information about a pattern    |  Reference counts  |  Matching a pattern: the traditional function  |  Extracting captured substrings by number  |  Extracting captured substrings by name  |  Duplicate subpattern names  |  Finding all possible matches  |  Obtaining an estimate of stack usage  |  Matching a pattern: the alternative function  |  See also  |

INFORMATION ABOUT A PATTERN

int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
            int what, void *where);

The pcre_fullinfo() function returns information about a compiled pattern. It replaces the pcre_info() function, which was removed from the library at version 8.30, after more than 10 years of obsolescence.

The first argument for pcre_fullinfo() is a pointer to the compiled pattern. The second argument is the result of pcre_study(), or NULL if the pattern was not studied. The third argument specifies which piece of information is required, and the fourth argument is a pointer to a variable to receive the data. The yield of the function is zero for success, or one of the following negative numbers:

PCRE_ERROR_NULL the argument code was NULL the argument where was NULL PCRE_ERROR_BADMAGIC the "magic number" was not found PCRE_ERROR_BADENDIANNESS the pattern was compiled with different endianness PCRE_ERROR_BADOPTION the value of what was invalid PCRE_ERROR_UNSET the requested field is not set

The "magic number" is placed at the start of each compiled pattern as a simple check against passing an arbitrary memory pointer. The endianness error can occur if a compiled pattern is saved and reloaded on a different host. Here is a typical call of pcre_fullinfo(), to obtain the length of the compiled pattern:

int rc; size_t length; rc = pcre_fullinfo( re, /* result of pcre_compile() */ sd, /* result of pcre_study(), or NULL */ PCRE_INFO_SIZE, /* what is required */ &length); /* where to put the data */

The possible values for the third argument are defined in pcre.h, and are as follows:

PCRE_INFO_BACKREFMAX

Return the number of the highest back reference in the pattern. The fourth argument should point to an int variable. Zero is returned if there are no back references.

PCRE_INFO_CAPTURECOUNT

Return the number of capturing subpatterns in the pattern. The fourth argument should point to an int variable.

PCRE_INFO_DEFAULT_TABLES

Return a pointer to the internal default character tables within PCRE. The fourth argument should point to an unsigned char * variable. This information call is provided for internal use by the pcre_study() function. External callers can cause PCRE to use its internal tables by passing a NULL table pointer.

PCRE_INFO_FIRSTBYTE (deprecated)

Return information about the first data unit of any matched string, for a non-anchored pattern. The name of this option refers to the 8-bit library, where data units are bytes. The fourth argument should point to an int variable. Negative values are used for special cases. However, this means that when the 32-bit library is in non-UTF-32 mode, the full 32-bit range of characters cannot be returned. For this reason, this value is deprecated; use PCRE_INFO_FIRSTCHARACTERFLAGS and PCRE_INFO_FIRSTCHARACTER instead.

If there is a fixed first value, for example, the letter "c" from a pattern such as (cat|cow|coyote), its value is returned. In the 8-bit library, the value is always less than 256. In the 16-bit library the value can be up to 0xffff. In the 32-bit library the value can be up to 0x10ffff.

If there is no fixed first value, and if either

(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch starts with "^", or

(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set (if it were set, the pattern would be anchored),

-1 is returned, indicating that the pattern matches only at the start of a subject string or after any newline within the string. Otherwise -2 is returned. For anchored patterns, -2 is returned.

PCRE_INFO_FIRSTCHARACTER

Return the value of the first data unit (non-UTF character) of any matched string in the situation where PCRE_INFO_FIRSTCHARACTERFLAGS returns 1; otherwise return 0. The fourth argument should point to a uint_t variable.

In the 8-bit library, the value is always less than 256. In the 16-bit library the value can be up to 0xffff. In the 32-bit library in UTF-32 mode the value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 mode.

PCRE_INFO_FIRSTCHARACTERFLAGS

Return information about the first data unit of any matched string, for a non-anchored pattern. The fourth argument should point to an int variable.

If there is a fixed first value, for example, the letter "c" from a pattern such as (cat|cow|coyote), 1 is returned, and the character value can be retrieved using PCRE_INFO_FIRSTCHARACTER. If there is no fixed first value, and if either

(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch starts with "^", or

(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set (if it were set, the pattern would be anchored),

2 is returned, indicating that the pattern matches only at the start of a subject string or after any newline within the string. Otherwise 0 is returned. For anchored patterns, 0 is returned.

PCRE_INFO_FIRSTTABLE

If the pattern was studied, and this resulted in the construction of a 256-bit table indicating a fixed set of values for the first data unit in any matching string, a pointer to the table is returned. Otherwise NULL is returned. The fourth argument should point to an unsigned char * variable.

PCRE_INFO_HASCRORLF

Return 1 if the pattern contains any explicit matches for CR or LF characters, otherwise 0. The fourth argument should point to an int variable. An explicit match is either a literal CR or LF character, or \r or \n.

PCRE_INFO_JCHANGED

Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise 0. The fourth argument should point to an int variable. (?J) and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.

PCRE_INFO_JIT

Return 1 if the pattern was studied with one of the JIT options, and just-in-time compiling was successful. The fourth argument should point to an int variable. A return value of 0 means that JIT support is not available in this version of PCRE, or that the pattern was not studied with a JIT option, or that the JIT compiler could not handle this particular pattern. See the pcrejit documentation for details of what can and cannot be handled.

PCRE_INFO_JITSIZE

If the pattern was successfully studied with a JIT option, return the size of the JIT compiled code, otherwise return zero. The fourth argument should point to a size_t variable.

PCRE_INFO_LASTLITERAL

Return the value of the rightmost literal data unit that must exist in any matched string, other than at its start, if such a value has been recorded. The fourth argument should point to an int variable. If there is no such value, -1 is returned. For anchored patterns, a last literal value is recorded only if it follows something of variable length. For example, for the pattern /^a\d+z\d+/ the returned value is "z", but for /^a\dz\d/ the returned value is -1.

Since for the 32-bit library using the non-UTF-32 mode, this function is unable to return the full 32-bit range of characters, this value is deprecated; instead the PCRE_INFO_REQUIREDCHARFLAGS and PCRE_INFO_REQUIREDCHAR values should be used.

PCRE_INFO_MATCH_EMPTY

Return 1 if the pattern can match an empty string, otherwise 0. The fourth argument should point to an int variable.

PCRE_INFO_MATCHLIMIT

If the pattern set a match limit by including an item of the form (*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth argument should point to an unsigned 32-bit integer. If no such value has been set, the call to pcre_fullinfo() returns the error PCRE_ERROR_UNSET.

PCRE_INFO_MAXLOOKBEHIND

Return the number of characters (NB not data units) in the longest lookbehind assertion in the pattern. This information is useful when doing multi-segment matching using the partial matching facilities. Note that the simple assertions \b and \B require a one-character lookbehind. \A also registers a one- character lookbehind, though it does not actually inspect the previous character. This is to ensure that at least one character from the old segment is retained when a new segment is processed. Otherwise, if there are no lookbehinds in the pattern, \A might match incorrectly at the start of a new segment.

PCRE_INFO_MINLENGTH

If the pattern was studied and a minimum length for matching subject strings was computed, its value is returned. Otherwise the returned value is -1. The value is a number of characters, which in UTF mode may be different from the number of data units. The fourth argument should point to an int variable. A non- negative value is a lower bound to the length of any matching string. There may not be any strings of that length that do actually match, but every string that does match is at least that long.

PCRE_INFO_NAMECOUNT PCRE_INFO_NAMEENTRYSIZE PCRE_INFO_NAMETABLE

PCRE supports the use of named as well as numbered capturing parentheses. The names are just an additional way of identifying the parentheses, which still acquire numbers. Several convenience functions such as pcre_get_named_substring() are provided for extracting captured substrings by name. It is also possible to extract the data directly, by first converting the name to a number in order to access the correct pointers in the output vector (described with pcre_exec() below). To do the conversion, you need to use the name-to-number map, which is described by these three values.

The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size of each entry; both of these return an int value. The entry size depends on the length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first entry of the table. This is a pointer to char in the 8-bit library, where the first two bytes of each entry are the number of the capturing parenthesis, most significant byte first. In the 16-bit library, the pointer points to 16-bit data units, the first of which contains the parenthesis number. In the 32-bit library, the pointer points to 32-bit data units, the first of which contains the parenthesis number. The rest of the entry is the corresponding name, zero terminated.

The names are in alphabetical order. If (?| is used to create multiple groups with the same number, as described in the section on duplicate subpattern numbers in the pcrepattern page, the groups may be given the same name, but there is only one entry in the table. Different names for groups of the same number are not permitted. Duplicate names for subpatterns with different numbers are permitted, but only if PCRE_DUPNAMES is set. They appear in the table in the order in which they were found in the pattern. In the absence of (?| this is the order of increasing number; when (?| is used this is not necessarily the case because later subpatterns may have lower numbers.

As a simple example of the name/number table, consider the following pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is set, so white space - including newlines - is ignored):

(?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) )

There are four named subpatterns, so the table has four entries, and each entry in the table is eight bytes long. The table is as follows, with non-printing bytes shows in hexadecimal, and undefined bytes shown as ??:

00 01 d a t e 00 ?? 00 05 d a y 00 ?? ?? 00 04 m o n t h 00 00 02 y e a r 00 ??

When writing code to extract data from named subpatterns using the name-to-number map, remember that the length of the entries is likely to be different for each compiled pattern.

PCRE_INFO_OKPARTIAL

Return 1 if the pattern can be used for partial matching with pcre_exec(), otherwise 0. The fourth argument should point to an int variable. From release 8.00, this always returns 1, because the restrictions that previously applied to partial matching have been lifted. The pcrepartial documentation gives details of partial matching.

PCRE_INFO_OPTIONS

Return a copy of the options with which the pattern was compiled. The fourth argument should point to an unsigned long int variable. These option bits are those specified in the call to pcre_compile(), modified by any top-level option settings at the start of the pattern itself. In other words, they are the options that will be in force when matching starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, and PCRE_EXTENDED.

A pattern is automatically anchored by PCRE if all of its top- level alternatives begin with one of the following:

^ unless PCRE_MULTILINE is set \A always \G always .* if PCRE_DOTALL is set and there are no back references to the subpattern in which .* appears

For such patterns, the PCRE_ANCHORED bit is set in the options returned by pcre_fullinfo().

PCRE_INFO_RECURSIONLIMIT

If the pattern set a recursion limit by including an item of the form (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth argument should point to an unsigned 32-bit integer. If no such value has been set, the call to pcre_fullinfo() returns the error PCRE_ERROR_UNSET.

PCRE_INFO_SIZE

Return the size of the compiled pattern in bytes (for all three libraries). The fourth argument should point to a size_t variable. This value does not include the size of the pcre structure that is returned by pcre_compile(). The value that is passed as the argument to pcre_malloc() when pcre_compile() is getting memory in which to place the compiled data is the value returned by this option plus the size of the pcre structure. Studying a compiled pattern, with or without JIT, does not alter the value returned by this option.

PCRE_INFO_STUDYSIZE

Return the size in bytes (for all three libraries) of the data block pointed to by the study_data field in a pcre_extra block. If pcre_extra is NULL, or there is no study data, zero is returned. The fourth argument should point to a size_t variable. The study_data field is set by pcre_study() to record information that will speed up matching (see the section entitled "Studying a pattern" above). The format of the study_data block is private, but its length is made available via this option so that it can be saved and restored (see the pcreprecompile documentation for details).

PCRE_INFO_REQUIREDCHARFLAGS

Returns 1 if there is a rightmost literal data unit that must exist in any matched string, other than at its start. The fourth argument should point to an int variable. If there is no such value, 0 is returned. If returning 1, the character value itself can be retrieved using PCRE_INFO_REQUIREDCHAR.

For anchored patterns, a last literal value is recorded only if it follows something of variable length. For example, for the pattern /^a\d+z\d+/ the returned value 1 (with "z" returned from PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0.

PCRE_INFO_REQUIREDCHAR

Return the value of the rightmost literal data unit that must exist in any matched string, other than at its start, if such a value has been recorded. The fourth argument should point to a uint32_t variable. If there is no such value, 0 is returned.