Путеводитель по Руководству Linux

  User  |  Syst  |  Libr  |  Device  |  Files  |  Other  |  Admin  |  Head  |



   pcretest    ( 1 )

программа для тестирования регулярных выражений, совместимых с Perl (a program for testing Perl-compatible regular expressions.)

PATTERN MODIFIERS

A pattern may be followed by any number of modifiers, which are mostly single characters, though some of these can be qualified by further characters. Following Perl usage, these are referred to below as, for example, "the /i modifier", even though the delimiter of the pattern need not always be a slash, and no slash is used when writing modifiers. White space may appear between the final pattern delimiter and the first modifier, and between the modifiers themselves. For reference, here is a complete list of modifiers. They fall into several groups that are described in detail in the following sections.

/8 set UTF mode /9 set PCRE_NEVER_UTF (locks out UTF mode) /? disable UTF validity check /+ show remainder of subject after match /= show all captures (not just those that are set)

/A set PCRE_ANCHORED /B show compiled code /C set PCRE_AUTO_CALLOUT /D same as /B plus /I /E set PCRE_DOLLAR_ENDONLY /F flip byte order in compiled pattern /f set PCRE_FIRSTLINE /G find all matches (shorten string) /g find all matches (use startoffset) /I show information about pattern /i set PCRE_CASELESS /J set PCRE_DUPNAMES /K show backtracking control names /L set locale /M show compiled memory size /m set PCRE_MULTILINE /N set PCRE_NO_AUTO_CAPTURE /O set PCRE_NO_AUTO_POSSESS /P use the POSIX wrapper /Q test external stack check function /S study the pattern after compilation /s set PCRE_DOTALL /T select character tables /U set PCRE_UNGREEDY /W set PCRE_UCP /X set PCRE_EXTRA /x set PCRE_EXTENDED /Y set PCRE_NO_START_OPTIMIZE /Z don't show lengths in /B output

/<any> set PCRE_NEWLINE_ANY /<anycrlf> set PCRE_NEWLINE_ANYCRLF /<cr> set PCRE_NEWLINE_CR /<crlf> set PCRE_NEWLINE_CRLF /<lf> set PCRE_NEWLINE_LF /<bsr_anycrlf> set PCRE_BSR_ANYCRLF /<bsr_unicode> set PCRE_BSR_UNICODE /<JS> set PCRE_JAVASCRIPT_COMPAT

Perl-compatible modifiers

The /i, /m, /s, and /x modifiers set the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre[16|32]_compile() is called. These four modifier letters have the same effect as they do in Perl. For example:

/caseless/i

Modifiers for other PCRE options

The following table shows additional modifiers for setting PCRE compile-time options that do not correspond to anything in Perl:

/8 PCRE_UTF8 ) when using the 8-bit /? PCRE_NO_UTF8_CHECK ) library

/8 PCRE_UTF16 ) when using the 16-bit /? PCRE_NO_UTF16_CHECK ) library

/8 PCRE_UTF32 ) when using the 32-bit /? PCRE_NO_UTF32_CHECK ) library

/9 PCRE_NEVER_UTF /A PCRE_ANCHORED /C PCRE_AUTO_CALLOUT /E PCRE_DOLLAR_ENDONLY /f PCRE_FIRSTLINE /J PCRE_DUPNAMES /N PCRE_NO_AUTO_CAPTURE /O PCRE_NO_AUTO_POSSESS /U PCRE_UNGREEDY /W PCRE_UCP /X PCRE_EXTRA /Y PCRE_NO_START_OPTIMIZE /<any> PCRE_NEWLINE_ANY /<anycrlf> PCRE_NEWLINE_ANYCRLF /<cr> PCRE_NEWLINE_CR /<crlf> PCRE_NEWLINE_CRLF /<lf> PCRE_NEWLINE_LF /<bsr_anycrlf> PCRE_BSR_ANYCRLF /<bsr_unicode> PCRE_BSR_UNICODE /<JS> PCRE_JAVASCRIPT_COMPAT

The modifiers that are enclosed in angle brackets are literal strings as shown, including the angle brackets, but the letters within can be in either case. This example sets multiline matching with CRLF as the line ending sequence:

/^abc/m<CRLF>

As well as turning on the PCRE_UTF8/16/32 option, the /8 modifier causes all non-printing characters in output strings to be printed using the \x{hh...} notation. Otherwise, those less than 0x100 are output in hex without the curly brackets.

Full details of the PCRE options are given in the pcreapi documentation.

Finding all matches in a string

Searching for all possible matches within each subject string can be requested by the /g or /G modifier. After finding a match, PCRE is called again to search the remainder of the subject string. The difference between /g and /G is that the former uses the startoffset argument to pcre[16|32]_exec() to start searching at a new point within the entire string (which is in effect what Perl does), whereas the latter passes over a shortened substring. This makes a difference to the matching process if the pattern begins with a lookbehind assertion (including \b or \B).

If any call to pcre[16|32]_exec() in a /g or /G sequence matches an empty string, the next call is done with the PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED flags set in order to search for another, non-empty, match at the same point. If this second match fails, the start offset is advanced, and the normal match is retried. This imitates the way Perl handles such cases when using the /g modifier or the split() function. Normally, the start offset is advanced by one character, but if the newline convention recognizes CRLF as a newline, and the current character is CR followed by LF, an advance of two is used.

Other modifiers

There are yet more modifiers for controlling the way pcretest operates.

The /+ modifier requests that as well as outputting the substring that matched the entire pattern, pcretest should in addition output the remainder of the subject string. This is useful for tests where the subject contains multiple copies of the same substring. If the + modifier appears twice, the same action is taken for captured substrings. In each case the remainder is output on the following line with a plus character following the capture number. Note that this modifier must not immediately follow the /S modifier because /S+ and /S++ have other meanings.

The /= modifier requests that the values of all potential captured parentheses be output after a match. By default, only those up to the highest one actually used in the match are output (corresponding to the return code from pcre[16|32]_exec()). Values in the offsets vector corresponding to higher numbers should be set to -1, and these are output as "<unset>". This modifier gives a way of checking that this is happening.

The /B modifier is a debugging feature. It requests that pcretest output a representation of the compiled code after compilation. Normally this information contains length and offset values; however, if /Z is also present, this data is replaced by spaces. This is a special feature for use in the automatic test scripts; it ensures that the same output is generated for different internal link sizes.

The /D modifier is a PCRE debugging feature, and is equivalent to /BI, that is, both the /B and the /I modifiers.

The /F modifier causes pcretest to flip the byte order of the 2-byte and 4-byte fields in the compiled pattern. This facility is for testing the feature in PCRE that allows it to execute patterns that were compiled on a host with a different endianness. This feature is not available when the POSIX interface to PCRE is being used, that is, when the /P pattern modifier is specified. See also the section about saving and reloading compiled patterns below.

The /I modifier requests that pcretest output information about the compiled pattern (whether it is anchored, has a fixed first character, and so on). It does this by calling pcre[16|32]_fullinfo() after compiling a pattern. If the pattern is studied, the results of that are also output. In this output, the word "char" means a non-UTF character, that is, the value of a single data item (8-bit, 16-bit, or 32-bit, depending on the library that is being tested).

The /K modifier requests pcretest to show names from backtracking control verbs that are returned from calls to pcre[16|32]_exec(). It causes pcretest to create a pcre[16|32]_extra block if one has not already been created by a call to pcre[16|32]_study(), and to set the PCRE_EXTRA_MARK flag and the mark field within it, every time that pcre[16|32]_exec() is called. If the variable that the mark field points to is non-NULL for a match, non-match, or partial match, pcretest prints the string to which it points. For a match, this is shown on a line by itself, tagged with "MK:". For a non-match it is added to the message.

The /L modifier must be followed directly by the name of a locale, for example,

/pattern/Lfr_FR

For this reason, it must be the last modifier. The given locale is set, pcre[16|32]_maketables() is called to build a set of character tables for the locale, and this is then passed to pcre[16|32]_compile() when compiling the regular expression. Without an /L (or /T) modifier, NULL is passed as the tables pointer; that is, /L applies only to the expression on which it appears.

The /M modifier causes the size in bytes of the memory block used to hold the compiled pattern to be output. This does not include the size of the pcre[16|32] block; it is just the actual compiled data. If the pattern is successfully studied with the PCRE_STUDY_JIT_COMPILE option, the size of the JIT compiled code is also output.

The /Q modifier is used to test the use of pcre_stack_guard. It must be followed by '0' or '1', specifying the return code to be given from an external function that is passed to PCRE and used for stack checking during compilation (see the pcreapi documentation for details).

The /S modifier causes pcre[16|32]_study() to be called after the expression has been compiled, and the results used when the expression is matched. There are a number of qualifying characters that may follow /S. They may appear in any order.

If /S is followed by an exclamation mark, pcre[16|32]_study() is called with the PCRE_STUDY_EXTRA_NEEDED option, causing it always to return a pcre_extra block, even when studying discovers no useful information.

If /S is followed by a second S character, it suppresses studying, even if it was requested externally by the -s command line option. This makes it possible to specify that certain patterns are always studied, and others are never studied, independently of -s. This feature is used in the test files in a few cases where the output is different when the pattern is studied.

If the /S modifier is followed by a + character, the call to pcre[16|32]_study() is made with all the JIT study options, requesting just-in-time optimization support if it is available, for both normal and partial matching. If you want to restrict the JIT compiling modes, you can follow /S+ with a digit in the range 1 to 7:

1 normal match only 2 soft partial match only 3 normal match and soft partial match 4 hard partial match only 6 soft and hard partial match 7 all three modes (default)

If /S++ is used instead of /S+ (with or without a following digit), the text "(JIT)" is added to the first output line after a match or no match when JIT-compiled code was actually used.

Note that there is also an independent /+ modifier; it must not be given immediately after /S or /S+ because this will be misinterpreted.

If JIT studying is successful, the compiled JIT code will automatically be used when pcre[16|32]_exec() is run, except when incompatible run-time options are specified. For more details, see the pcrejit documentation. See also the \J escape sequence below for a way of setting the size of the JIT stack.

Finally, if /S is followed by a minus character, JIT compilation is suppressed, even if it was requested externally by the -s command line option. This makes it possible to specify that JIT is never to be used for certain patterns.

The /T modifier must be followed by a single digit. It causes a specific set of built-in character tables to be passed to pcre[16|32]_compile(). It is used in the standard PCRE tests to check behaviour with different character tables. The digit specifies the tables as follows:

0 the default ASCII tables, as distributed in pcre_chartables.c.dist 1 a set of tables defining ISO 8859 characters

In table 1, some characters whose codes are greater than 128 are identified as letters, digits, spaces, etc.

Using the POSIX wrapper API

The /P modifier causes pcretest to call PCRE via the POSIX wrapper API rather than its native API. This supports only the 8-bit library. When /P is set, the following modifiers set options for the regcomp() function:

/i REG_ICASE /m REG_NEWLINE /N REG_NOSUB /s REG_DOTALL ) /U REG_UNGREEDY ) These options are not part of /W REG_UCP ) the POSIX standard /8 REG_UTF8 )

The /+ modifier works as described above. All other modifiers are ignored.

Locking out certain modifiers

PCRE can be compiled with or without support for certain features such as UTF-8/16/32 or Unicode properties. Accordingly, the standard tests are split up into a number of different files that are selected for running depending on which features are available. When updating the tests, it is all too easy to put a new test into the wrong file by mistake; for example, to put a test that requires UTF support into a file that is used when it is not available. To help detect such mistakes as early as possible, there is a facility for locking out specific modifiers. If an input line for pcretest starts with the string "< forbid " the following sequence of characters is taken as a list of forbidden modifiers. For example, in the test files that must not use UTF or Unicode property support, this line appears:

< forbid 8W

This locks out the /8 and /W modifiers. An immediate error is given if they are subsequently encountered. If the character string contains < but not >, all the multi-character modifiers that begin with < are locked out. Otherwise, such modifiers must be explicitly listed, for example:

< forbid <JS><cr>

There must be a single space between < and "forbid" for this feature to be recognised. If there is not, the line is interpreted either as a request to re-load a pre-compiled pattern (see "SAVING AND RELOADING COMPILED PATTERNS" below) or, if there is a another < character, as a pattern that uses < as its delimiter.