Perl supports the POSIX notation for character classes. This uses
names enclosed by [: and :] within the enclosing square brackets.
PCRE also supports this notation. For example,
[01[:alpha:]%]
matches "0", "1", any alphabetic character, or "%". The supported
class names are:
alnum letters and digits
alpha letters
ascii character codes 0 - 127
blank space or tab only
cntrl control characters
digit decimal digits (same as \d)
graph printing characters, excluding space
lower lower case letters
print printing characters, including space
punct printing characters, excluding letters and digits and
space
space white space (the same as \s from PCRE 8.34)
upper upper case letters
word "word" characters (same as \w)
xdigit hexadecimal digits
The default "space" characters are HT (9), LF (10), VT (11), FF
(12), CR (13), and space (32). If locale-specific matching is
taking place, the list of space characters may be different;
there may be fewer or more of them. "Space" used to be different
to \s, which did not include VT, for Perl compatibility.
However, Perl changed at release 5.18, and PCRE followed at
release 8.34. "Space" and \s now match the same set of
characters.
The name "word" is a Perl extension, and "blank" is a GNU
extension from Perl 5.8. Another Perl extension is negation,
which is indicated by a ^ character after the colon. For example,
[12[:^digit:]]
matches "1", "2", or any non-digit. PCRE (and Perl) also
recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
"collating element", but these are not supported, and an error is
given if they are encountered.
By default, characters with values greater than 128 do not match
any of the POSIX character classes. However, if the PCRE_UCP
option is passed to pcre_compile()
, some of the classes are
changed so that Unicode character properties are used. This is
achieved by replacing certain POSIX classes by other sequences,
as follows:
[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
[:blank:] becomes \h
[:digit:] becomes \p{Nd}
[:lower:] becomes \p{Ll}
[:space:] becomes \p{Xps}
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}
Negated versions, such as [:^alpha:] use \P instead of \p. Three
other POSIX classes are handled specially in UCP mode:
[:graph:]
This matches characters that have glyphs that mark the
page when printed. In Unicode property terms, it matches
all characters with the L, M, N, P, S, or Cf properties,
except for:
U+061C Arabic Letter Mark
U+180E Mongolian Vowel Separator
U+2066 - U+2069 Various "isolate"s
[:print:]
This matches the same characters as [:graph:] plus space
characters that are not controls, that is, characters with
the Zs property.
[:punct:]
This matches all characters that have the Unicode P
(punctuation) property, plus those characters whose code
points are less than 128 that have the S (Symbol)
property.
The other POSIX classes are unchanged, and match only characters
with code points less than 128.