программа для тестирования регулярных выражений, совместимых с Perl (a program for testing Perl-compatible regular expressions.)
PATTERN MODIFIERS
A pattern may be followed by any number of modifiers, which are
mostly single characters, though some of these can be qualified
by further characters. Following Perl usage, these are referred
to below as, for example, "the /i
modifier", even though the
delimiter of the pattern need not always be a slash, and no slash
is used when writing modifiers. White space may appear between
the final pattern delimiter and the first modifier, and between
the modifiers themselves. For reference, here is a complete list
of modifiers. They fall into several groups that are described in
detail in the following sections.
/8
set UTF mode
/9
set PCRE_NEVER_UTF (locks out UTF mode)
/?
disable UTF validity check
/+
show remainder of subject after match
/=
show all captures (not just those that are set)
/A
set PCRE_ANCHORED
/B
show compiled code
/C
set PCRE_AUTO_CALLOUT
/D
same as /B
plus /I
/E
set PCRE_DOLLAR_ENDONLY
/F
flip byte order in compiled pattern
/f
set PCRE_FIRSTLINE
/G
find all matches (shorten string)
/g
find all matches (use startoffset)
/I
show information about pattern
/i
set PCRE_CASELESS
/J
set PCRE_DUPNAMES
/K
show backtracking control names
/L
set locale
/M
show compiled memory size
/m
set PCRE_MULTILINE
/N
set PCRE_NO_AUTO_CAPTURE
/O
set PCRE_NO_AUTO_POSSESS
/P
use the POSIX wrapper
/Q
test external stack check function
/S
study the pattern after compilation
/s
set PCRE_DOTALL
/T
select character tables
/U
set PCRE_UNGREEDY
/W
set PCRE_UCP
/X
set PCRE_EXTRA
/x
set PCRE_EXTENDED
/Y
set PCRE_NO_START_OPTIMIZE
/Z
don't show lengths in /B
output
/<any>
set PCRE_NEWLINE_ANY
/<anycrlf>
set PCRE_NEWLINE_ANYCRLF
/<cr>
set PCRE_NEWLINE_CR
/<crlf>
set PCRE_NEWLINE_CRLF
/<lf>
set PCRE_NEWLINE_LF
/<bsr_anycrlf>
set PCRE_BSR_ANYCRLF
/<bsr_unicode>
set PCRE_BSR_UNICODE
/<JS>
set PCRE_JAVASCRIPT_COMPAT
Perl-compatible modifiers
The /i
, /m
, /s
, and /x
modifiers set the PCRE_CASELESS,
PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options,
respectively, when pcre[16|32]_compile()
is called. These four
modifier letters have the same effect as they do in Perl. For
example:
/caseless/i
Modifiers for other PCRE options
The following table shows additional modifiers for setting PCRE
compile-time options that do not correspond to anything in Perl:
/8
PCRE_UTF8 ) when using the 8-bit
/?
PCRE_NO_UTF8_CHECK ) library
/8
PCRE_UTF16 ) when using the 16-bit
/?
PCRE_NO_UTF16_CHECK ) library
/8
PCRE_UTF32 ) when using the 32-bit
/?
PCRE_NO_UTF32_CHECK ) library
/9
PCRE_NEVER_UTF
/A
PCRE_ANCHORED
/C
PCRE_AUTO_CALLOUT
/E
PCRE_DOLLAR_ENDONLY
/f
PCRE_FIRSTLINE
/J
PCRE_DUPNAMES
/N
PCRE_NO_AUTO_CAPTURE
/O
PCRE_NO_AUTO_POSSESS
/U
PCRE_UNGREEDY
/W
PCRE_UCP
/X
PCRE_EXTRA
/Y
PCRE_NO_START_OPTIMIZE
/<any>
PCRE_NEWLINE_ANY
/<anycrlf>
PCRE_NEWLINE_ANYCRLF
/<cr>
PCRE_NEWLINE_CR
/<crlf>
PCRE_NEWLINE_CRLF
/<lf>
PCRE_NEWLINE_LF
/<bsr_anycrlf>
PCRE_BSR_ANYCRLF
/<bsr_unicode>
PCRE_BSR_UNICODE
/<JS>
PCRE_JAVASCRIPT_COMPAT
The modifiers that are enclosed in angle brackets are literal
strings as shown, including the angle brackets, but the letters
within can be in either case. This example sets multiline
matching with CRLF as the line ending sequence:
/^abc/m<CRLF>
As well as turning on the PCRE_UTF8/16/32 option, the /8
modifier
causes all non-printing characters in output strings to be
printed using the \x{hh...} notation. Otherwise, those less than
0x100 are output in hex without the curly brackets.
Full details of the PCRE options are given in the pcreapi
documentation.
Finding all matches in a string
Searching for all possible matches within each subject string can
be requested by the /g
or /G
modifier. After finding a match,
PCRE is called again to search the remainder of the subject
string. The difference between /g
and /G
is that the former uses
the startoffset argument to pcre[16|32]_exec()
to start searching
at a new point within the entire string (which is in effect what
Perl does), whereas the latter passes over a shortened substring.
This makes a difference to the matching process if the pattern
begins with a lookbehind assertion (including \b or \B).
If any call to pcre[16|32]_exec()
in a /g
or /G
sequence matches
an empty string, the next call is done with the
PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED flags set in order to
search for another, non-empty, match at the same point. If this
second match fails, the start offset is advanced, and the normal
match is retried. This imitates the way Perl handles such cases
when using the /g
modifier or the split()
function. Normally, the
start offset is advanced by one character, but if the newline
convention recognizes CRLF as a newline, and the current
character is CR followed by LF, an advance of two is used.
Other modifiers
There are yet more modifiers for controlling the way pcretest
operates.
The /+
modifier requests that as well as outputting the substring
that matched the entire pattern, pcretest
should in addition
output the remainder of the subject string. This is useful for
tests where the subject contains multiple copies of the same
substring. If the +
modifier appears twice, the same action is
taken for captured substrings. In each case the remainder is
output on the following line with a plus character following the
capture number. Note that this modifier must not immediately
follow the /S modifier because /S+ and /S++ have other meanings.
The /=
modifier requests that the values of all potential
captured parentheses be output after a match. By default, only
those up to the highest one actually used in the match are output
(corresponding to the return code from pcre[16|32]_exec()
).
Values in the offsets vector corresponding to higher numbers
should be set to -1, and these are output as "<unset>". This
modifier gives a way of checking that this is happening.
The /B
modifier is a debugging feature. It requests that pcretest
output a representation of the compiled code after compilation.
Normally this information contains length and offset values;
however, if /Z
is also present, this data is replaced by spaces.
This is a special feature for use in the automatic test scripts;
it ensures that the same output is generated for different
internal link sizes.
The /D
modifier is a PCRE debugging feature, and is equivalent to
/BI
, that is, both the /B
and the /I
modifiers.
The /F
modifier causes pcretest
to flip the byte order of the
2-byte and 4-byte fields in the compiled pattern. This facility
is for testing the feature in PCRE that allows it to execute
patterns that were compiled on a host with a different
endianness. This feature is not available when the POSIX
interface to PCRE is being used, that is, when the /P
pattern
modifier is specified. See also the section about saving and
reloading compiled patterns below.
The /I
modifier requests that pcretest
output information about
the compiled pattern (whether it is anchored, has a fixed first
character, and so on). It does this by calling
pcre[16|32]_fullinfo()
after compiling a pattern. If the pattern
is studied, the results of that are also output. In this output,
the word "char" means a non-UTF character, that is, the value of
a single data item (8-bit, 16-bit, or 32-bit, depending on the
library that is being tested).
The /K
modifier requests pcretest
to show names from backtracking
control verbs that are returned from calls to pcre[16|32]_exec()
.
It causes pcretest
to create a pcre[16|32]_extra
block if one has
not already been created by a call to pcre[16|32]_study()
, and to
set the PCRE_EXTRA_MARK flag and the mark
field within it, every
time that pcre[16|32]_exec()
is called. If the variable that the
mark
field points to is non-NULL for a match, non-match, or
partial match, pcretest
prints the string to which it points. For
a match, this is shown on a line by itself, tagged with "MK:".
For a non-match it is added to the message.
The /L
modifier must be followed directly by the name of a
locale, for example,
/pattern/Lfr_FR
For this reason, it must be the last modifier. The given locale
is set, pcre[16|32]_maketables()
is called to build a set of
character tables for the locale, and this is then passed to
pcre[16|32]_compile()
when compiling the regular expression.
Without an /L
(or /T
) modifier, NULL is passed as the tables
pointer; that is, /L
applies only to the expression on which it
appears.
The /M
modifier causes the size in bytes of the memory block used
to hold the compiled pattern to be output. This does not include
the size of the pcre[16|32]
block; it is just the actual compiled
data. If the pattern is successfully studied with the
PCRE_STUDY_JIT_COMPILE option, the size of the JIT compiled code
is also output.
The /Q
modifier is used to test the use of pcre_stack_guard
. It
must be followed by '0' or '1', specifying the return code to be
given from an external function that is passed to PCRE and used
for stack checking during compilation (see the pcreapi
documentation for details).
The /S
modifier causes pcre[16|32]_study()
to be called after the
expression has been compiled, and the results used when the
expression is matched. There are a number of qualifying
characters that may follow /S
. They may appear in any order.
If /S
is followed by an exclamation mark, pcre[16|32]_study()
is
called with the PCRE_STUDY_EXTRA_NEEDED option, causing it always
to return a pcre_extra
block, even when studying discovers no
useful information.
If /S
is followed by a second S character, it suppresses
studying, even if it was requested externally by the -s
command
line option. This makes it possible to specify that certain
patterns are always studied, and others are never studied,
independently of -s
. This feature is used in the test files in a
few cases where the output is different when the pattern is
studied.
If the /S
modifier is followed by a + character, the call to
pcre[16|32]_study()
is made with all the JIT study options,
requesting just-in-time optimization support if it is available,
for both normal and partial matching. If you want to restrict the
JIT compiling modes, you can follow /S+
with a digit in the range
1 to 7:
1 normal match only
2 soft partial match only
3 normal match and soft partial match
4 hard partial match only
6 soft and hard partial match
7 all three modes (default)
If /S++
is used instead of /S+
(with or without a following
digit), the text "(JIT)" is added to the first output line after
a match or no match when JIT-compiled code was actually used.
Note that there is also an independent /+
modifier; it must not
be given immediately after /S
or /S+
because this will be
misinterpreted.
If JIT studying is successful, the compiled JIT code will
automatically be used when pcre[16|32]_exec()
is run, except when
incompatible run-time options are specified. For more details,
see the pcrejit
documentation. See also the \J
escape sequence
below for a way of setting the size of the JIT stack.
Finally, if /S
is followed by a minus character, JIT compilation
is suppressed, even if it was requested externally by the -s
command line option. This makes it possible to specify that JIT
is never to be used for certain patterns.
The /T
modifier must be followed by a single digit. It causes a
specific set of built-in character tables to be passed to
pcre[16|32]_compile()
. It is used in the standard PCRE tests to
check behaviour with different character tables. The digit
specifies the tables as follows:
0 the default ASCII tables, as distributed in
pcre_chartables.c.dist
1 a set of tables defining ISO 8859 characters
In table 1, some characters whose codes are greater than 128 are
identified as letters, digits, spaces, etc.
Using the POSIX wrapper API
The /P
modifier causes pcretest
to call PCRE via the POSIX
wrapper API rather than its native API. This supports only the
8-bit library. When /P
is set, the following modifiers set
options for the regcomp()
function:
/i REG_ICASE
/m REG_NEWLINE
/N REG_NOSUB
/s REG_DOTALL )
/U REG_UNGREEDY ) These options are not part of
/W REG_UCP ) the POSIX standard
/8 REG_UTF8 )
The /+
modifier works as described above. All other modifiers are
ignored.
Locking out certain modifiers
PCRE can be compiled with or without support for certain features
such as UTF-8/16/32 or Unicode properties. Accordingly, the
standard tests are split up into a number of different files that
are selected for running depending on which features are
available. When updating the tests, it is all too easy to put a
new test into the wrong file by mistake; for example, to put a
test that requires UTF support into a file that is used when it
is not available. To help detect such mistakes as early as
possible, there is a facility for locking out specific modifiers.
If an input line for pcretest
starts with the string "< forbid "
the following sequence of characters is taken as a list of
forbidden modifiers. For example, in the test files that must not
use UTF or Unicode property support, this line appears:
< forbid 8W
This locks out the /8 and /W modifiers. An immediate error is
given if they are subsequently encountered. If the character
string contains < but not >, all the multi-character modifiers
that begin with < are locked out. Otherwise, such modifiers must
be explicitly listed, for example:
< forbid <JS><cr>
There must be a single space between < and "forbid" for this
feature to be recognised. If there is not, the line is
interpreted either as a request to re-load a pre-compiled pattern
(see "SAVING AND RELOADING COMPILED PATTERNS" below) or, if there
is a another < character, as a pattern that uses < as its
delimiter.