Perl-совместимые регулярные выражения (Perl-compatible regular expressions)
SQUARE BRACKETS AND CHARACTER CLASSES
An opening square bracket introduces a character class,
terminated by a closing square bracket. A closing square bracket
on its own is not special by default. However, if the
PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square
bracket causes a compile-time error. If a closing square bracket
is required as a member of the class, it should be the first data
character in the class (after an initial circumflex, if present)
or escaped with a backslash.
A character class matches a single character in the subject. In a
UTF mode, the character may be more than one data unit long. A
matched character must be in the set of characters defined by the
class, unless the first character in the class definition is a
circumflex, in which case the subject character must not be in
the set defined by the class. If a circumflex is actually
required as a member of the class, ensure it is not the first
character, or escape it with a backslash.
For example, the character class [aeiou] matches any lower case
vowel, while [^aeiou] matches any character that is not a lower
case vowel. Note that a circumflex is just a convenient notation
for specifying the characters that are in the class by
enumerating those that are not. A class that starts with a
circumflex is not an assertion; it still consumes a character
from the subject string, and therefore it fails if the current
pointer is at the end of the string.
In UTF-8 (UTF-16, UTF-32) mode, characters with values greater
than 255 (0xffff) can be included in a class as a literal string
of data units, or by using the \x{ escaping mechanism.
When caseless matching is set, any letters in a class represent
both their upper case and lower case versions, so for example, a
caseless [aeiou] matches "A" as well as "a", and a caseless
[^aeiou] does not match "A", whereas a caseful version would. In
a UTF mode, PCRE always understands the concept of case for
characters whose values are less than 128, so caseless matching
is always possible. For characters with higher values, the
concept of case is supported if PCRE is compiled with Unicode
property support, but not otherwise. If you want to use caseless
matching in a UTF mode for characters 128 and above, you must
ensure that PCRE is compiled with Unicode property support as
well as with UTF support.
Characters that might indicate line breaks are never treated in
any special way when matching character classes, whatever line-
ending sequence is in use, and whatever setting of the
PCRE_DOTALL and PCRE_MULTILINE options is used. A class such as
[^a] always matches one of these characters.
The minus (hyphen) character can be used to specify a range of
characters in a character class. For example, [d-m] matches any
letter between d and m, inclusive. If a minus character is
required in a class, it must be escaped with a backslash or
appear in a position where it cannot be interpreted as indicating
a range, typically as the first or last character in the class,
or immediately after a range. For example, [b-d-z] matches
letters in the range b to d, a hyphen character, or z.
It is not possible to have the literal character "]" as the end
character of a range. A pattern such as [W-]46] is interpreted as
a class of two characters ("W" and "-") followed by a literal
string "46]", so it would match "W46]" or "-46]". However, if the
"]" is escaped with a backslash it is interpreted as the end of
range, so [W-\]46] is interpreted as a class containing a range
followed by two other characters. The octal or hexadecimal
representation of "]" can also be used to end a range.
An error is generated if a POSIX character class (see below) or
an escape sequence other than one that defines a single character
appears at a point where a range ending character is expected.
For example, [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are
not.
Ranges operate in the collating sequence of character values.
They can also be used for characters specified numerically, for
example [\000-\037]. Ranges can include any characters that are
valid for the current mode.
If a range that includes letters is used when caseless matching
is set, it matches the letters in either case. For example, [W-c]
is equivalent to [][\\^_`wxyzabc], matched caselessly, and in a
non-UTF mode, if character tables for a French locale are in use,
[\xc8-\xcb] matches accented E characters in both cases. In UTF
modes, PCRE supports the concept of case for characters with
values greater than 128 only when it is compiled with Unicode
property support.
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S,
\v, \V, \w, and \W may appear in a character class, and add the
characters that they match to the class. For example, [\dABCDEF]
matches any hexadecimal digit. In UTF modes, the PCRE_UCP option
affects the meanings of \d, \s, \w and their upper case partners,
just as it does when they appear outside a character class, as
described in the section entitled "Generic character types"
above. The escape sequence \b has a different meaning inside a
character class; it matches the backspace character. The
sequences \B, \N, \R, and \X are not special inside a character
class. Like any other unrecognized escape sequences, they are
treated as the literal characters "B", "N", "R", and "X" by
default, but cause an error if the PCRE_EXTRA option is set.
A circumflex can conveniently be used with the upper case
character types to specify a more restricted set of characters
than the matching lower case type. For example, the class [^\W_]
matches any letter or digit, but not underscore, whereas [\w]
includes underscore. A positive character class should be read as
"something OR something OR ..." and a negative class as "NOT
something AND NOT something AND NOT ...".
The only metacharacters that are recognized in character classes
are backslash, hyphen (only where it can be interpreted as
specifying a range), circumflex (only at the start), opening
square bracket (only when it can be interpreted as introducing a
POSIX class name, or for a special compatibility feature - see
the next two sections), and the terminating closing square
bracket. However, escaping other non-alphanumeric characters does
no harm.