Путеводитель по Руководству Linux

  User  |  Syst  |  Libr  |  Device  |  Files  |  Other  |  Admin  |  Head  |



   pcrepattern    ( 3 )

Perl-совместимые регулярные выражения (Perl-compatible regular expressions)

  Name  |  Pcre regular expression details  |  Special start-of-pattern items  |  Ebcdic character codes  |  Characters and metacharacters  |  Backslash  |  Circumflex and dollar  |  Full stop (period, dot) and \n  |    Matching a single data unit    |  Square brackets and character classes  |  Posix character classes  |  Compatibility feature for word boundaries  |  Vertical bar  |  Internal option setting  |  Subpatterns  |  Duplicate subpattern numbers  |  Named subpatterns  |  Repetition  |  Atomic grouping and possessive quantifiers  |  Back references  |  Assertions  |  Conditional subpatterns  |  Comments  |  Recursive patterns  |  Subpatterns as subroutines  |  Oniguruma subroutine syntax  |  Callouts  |  Backtracking control  |  See also  |

MATCHING A SINGLE DATA UNIT

Outside a character class, the escape sequence \C matches any one
       data unit, whether or not a UTF mode is set. In the 8-bit
       library, one data unit is one byte; in the 16-bit library it is a
       16-bit unit; in the 32-bit library it is a 32-bit unit. Unlike a
       dot, \C always matches line-ending characters. The feature is
       provided in Perl in order to match individual bytes in UTF-8
       mode, but it is unclear how it can usefully be used. Because \C
       breaks up characters into individual data units, matching one
       unit with \C in a UTF mode means that the rest of the string may
       start with a malformed UTF character. This has undefined results,
       because PCRE assumes that it is dealing with valid UTF strings
       (and by default it checks this at the start of processing unless
       the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or
       PCRE_NO_UTF32_CHECK option is used).

PCRE does not allow \C to appear in lookbehind assertions (described below) in a UTF mode, because this would make it impossible to calculate the length of the lookbehind.

In general, the \C escape sequence is best avoided. However, one way of using it that avoids the problem of malformed UTF characters is to use a lookahead to check the length of the next character, as in this pattern, which could be used with a UTF-8 string (ignore white space and line breaks):

(?| (?=[\x00-\x7f])(\C) | (?=[\x80-\x{7ff}])(\C)(\C) | (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))

A group that starts with (?| resets the capturing parentheses numbers in each alternative (see "Duplicate Subpattern Numbers" below). The assertions at the start of each branch check the next UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The character's individual bytes are then captured by the appropriate number of groups.