`pcrepattern` ( 3 )

Perl-совместимые регулярные выражения (Perl-compatible regular expressions)

POSIX CHARACTER CLASSES

Perl supports the POSIX notation for character classes. This uses names enclosed by [: and :] within the enclosing square brackets. PCRE also supports this notation. For example,

[01[:alpha:]%]

matches "0", "1", any alphabetic character, or "%". The supported class names are:

alnum letters and digits alpha letters ascii character codes 0 - 127 blank space or tab only cntrl control characters digit decimal digits (same as \d) graph printing characters, excluding space lower lower case letters print printing characters, including space punct printing characters, excluding letters and digits and space space white space (the same as \s from PCRE 8.34) upper upper case letters word "word" characters (same as \w) xdigit hexadecimal digits

The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and space (32). If locale-specific matching is taking place, the list of space characters may be different; there may be fewer or more of them. "Space" used to be different to \s, which did not include VT, for Perl compatibility. However, Perl changed at release 5.18, and PCRE followed at release 8.34. "Space" and \s now match the same set of characters.

The name "word" is a Perl extension, and "blank" is a GNU extension from Perl 5.8. Another Perl extension is negation, which is indicated by a ^ character after the colon. For example,

[12[:^digit:]]

matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not supported, and an error is given if they are encountered.

By default, characters with values greater than 128 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing certain POSIX classes by other sequences, as follows:

[:alnum:] becomes \p{Xan} [:alpha:] becomes \p{L} [:blank:] becomes \h [:digit:] becomes \p{Nd} [:lower:] becomes \p{Ll} [:space:] becomes \p{Xps} [:upper:] becomes \p{Lu} [:word:] becomes \p{Xwd}

Negated versions, such as [:^alpha:] use \P instead of \p. Three other POSIX classes are handled specially in UCP mode:

[:graph:] This matches characters that have glyphs that mark the page when printed. In Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf properties, except for:

U+061C Arabic Letter Mark U+180E Mongolian Vowel Separator U+2066 - U+2069 Various "isolate"s

[:print:] This matches the same characters as [:graph:] plus space characters that are not controls, that is, characters with the Zs property.

[:punct:] This matches all characters that have the Unicode P (punctuation) property, plus those characters whose code points are less than 128 that have the S (Symbol) property.

The other POSIX classes are unchanged, and match only characters with code points less than 128.

Исходный текст на man7.org

pcrepattern ( 3 )

POSIX CHARACTER CLASSES

`pcrepattern` ( 3 )