Perl-совместимые регулярные выражения (Perl-compatible regular expressions)
MATCHING A SINGLE DATA UNIT
Outside a character class, the escape sequence \C matches any one
data unit, whether or not a UTF mode is set. In the 8-bit
library, one data unit is one byte; in the 16-bit library it is a
16-bit unit; in the 32-bit library it is a 32-bit unit. Unlike a
dot, \C always matches line-ending characters. The feature is
provided in Perl in order to match individual bytes in UTF-8
mode, but it is unclear how it can usefully be used. Because \C
breaks up characters into individual data units, matching one
unit with \C in a UTF mode means that the rest of the string may
start with a malformed UTF character. This has undefined results,
because PCRE assumes that it is dealing with valid UTF strings
(and by default it checks this at the start of processing unless
the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or
PCRE_NO_UTF32_CHECK option is used).
PCRE does not allow \C to appear in lookbehind assertions
(described below) in a UTF mode, because this would make it
impossible to calculate the length of the lookbehind.
In general, the \C escape sequence is best avoided. However, one
way of using it that avoids the problem of malformed UTF
characters is to use a lookahead to check the length of the next
character, as in this pattern, which could be used with a UTF-8
string (ignore white space and line breaks):
(?| (?=[\x00-\x7f])(\C) |
(?=[\x80-\x{7ff}])(\C)(\C) |
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
A group that starts with (?| resets the capturing parentheses
numbers in each alternative (see "Duplicate Subpattern Numbers"
below). The assertions at the start of each branch check the next
UTF-8 character for values whose encoding uses 1, 2, 3, or 4
bytes, respectively. The character's individual bytes are then
captured by the appropriate number of groups.