Путеводитель по Руководству Linux

  User  |  Syst  |  Libr  |  Device  |  Files  |  Other  |  Admin  |  Head  |



   pcrepattern    ( 3 )

Perl-совместимые регулярные выражения (Perl-compatible regular expressions)

  Name  |  Pcre regular expression details  |    Special start-of-pattern items    |  Ebcdic character codes  |  Characters and metacharacters  |  Backslash  |  Circumflex and dollar  |  Full stop (period, dot) and \n  |  Matching a single data unit  |  Square brackets and character classes  |  Posix character classes  |  Compatibility feature for word boundaries  |  Vertical bar  |  Internal option setting  |  Subpatterns  |  Duplicate subpattern numbers  |  Named subpatterns  |  Repetition  |  Atomic grouping and possessive quantifiers  |  Back references  |  Assertions  |  Conditional subpatterns  |  Comments  |  Recursive patterns  |  Subpatterns as subroutines  |  Oniguruma subroutine syntax  |  Callouts  |  Backtracking control  |  See also  |

SPECIAL START-OF-PATTERN ITEMS

A number of options that can be passed to pcre_compile() can also
       be set by special items at the start of a pattern. These are not
       Perl-compatible, but are provided to make these options
       accessible to pattern writers who are not able to change the
       program that processes the pattern. Any number of these items may
       appear, but they must all be together right at the start of the
       pattern string, and the letters must be in upper case.

UTF support

The original operation of PCRE was on strings of one-byte characters. However, there is now also support for UTF-8 strings in the original library, an extra library that supports 16-bit and UTF-16 character strings, and a third library that supports 32-bit and UTF-32 character strings. To use these features, PCRE must be built to include appropriate support. When using UTF strings you must either call the compiling function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of these special sequences:

(*UTF8) (*UTF16) (*UTF32) (*UTF)

(*UTF) is a generic sequence that can be used with any of the libraries. Starting a pattern with such a sequence is equivalent to setting the relevant option. How setting a UTF mode affects pattern matching is mentioned in several places below. There is also a summary of features in the pcreunicode page.

Some applications that allow their users to supply patterns may wish to restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF option is set at compile time, (*UTF) etc. are not allowed, and their appearance causes an error.

Unicode property support

Another special sequence that may appear at the start of a pattern is (*UCP). This has the same effect as setting the PCRE_UCP option: it causes sequences such as \d and \w to use Unicode properties to determine character types, instead of recognizing only characters with codes less than 128 via a lookup table.

Disabling auto-possessification

If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making quantifiers possessive when what follows cannot match the repeated item. For example, by default a+b is treated as a++b. For more details, see the pcreapi documentation.

Disabling start-up optimizations

If a pattern starts with (*NO_START_OPT), it has the same effect as setting the PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables several optimizations for quickly reaching "no match" results. For more details, see the pcreapi documentation.

Newline conventions

PCRE supports five different conventions for indicating line breaks in strings: a single CR (carriage return) character, a single LF (linefeed) character, the two-character sequence CRLF, any of the three preceding, or any Unicode newline sequence. The pcreapi page has further discussion about newlines, and shows how to set the newline convention in the options arguments for the compiling and matching functions.

It is also possible to specify a newline convention by starting a pattern string with one of the following five sequences:

(*CR) carriage return (*LF) linefeed (*CRLF) carriage return, followed by linefeed (*ANYCRLF) any of the three above (*ANY) all Unicode newline sequences

These override the default and the options given to the compiling function. For example, on a Unix system where LF is the default newline sequence, the pattern

(*CR)a.b

changes the convention to CR. That pattern matches "a\nb" because LF is no longer a newline. If more than one of these settings is present, the last one is used.

The newline convention affects where the circumflex and dollar assertions are true. It also affects the interpretation of the dot metacharacter when PCRE_DOTALL is not set, and the behaviour of \N. However, it does not affect what the \R escape sequence matches. By default, this is any Unicode newline sequence, for Perl compatibility. However, this can be changed; see the description of \R in the section entitled "Newline sequences" below. A change of \R setting can be combined with a change of newline convention.

Setting match and recursion limits

The caller of pcre_exec() can set a limit on the number of times the internal match() function is called and on the maximum depth of recursive calls. These facilities are provided to catch runaway matches that are provoked by patterns with huge matching trees (a typical example is a pattern with nested unlimited repeats) and to avoid running out of system stack by too much recursion. When one of these limits is reached, pcre_exec() gives an error return. The limits can also be set by items at the start of the pattern of the form

(*LIMIT_MATCH=d) (*LIMIT_RECURSION=d)

where d is any number of decimal digits. However, the value of the setting must be less than the value set (or defaulted) by the caller of pcre_exec() for it to have any effect. In other words, the pattern writer can lower the limits set by the programmer, but not raise them. If there is more than one setting of one of these limits, the lower value is used.