Perl-совместимые регулярные выражения (Perl-compatible regular expressions)
LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters
are letters, digits, or whatever, by reference to a set of
tables, indexed by character code point. When running in UTF-8
mode, or in the 16- or 32-bit libraries, this applies only to
characters with code points less than 256. By default, higher-
valued code points never match escapes such as \w or \d. However,
if PCRE is built with Unicode property support, all characters
can be tested with \p and \P, or, alternatively, the PCRE_UCP
option can be set when a pattern is compiled; this causes \w and
friends to use Unicode property support instead of the built-in
tables.
The use of locales with Unicode is discouraged. If you are
handling characters with code points greater than 128, you should
either use Unicode support, or use locales, but not try to mix
the two.
PCRE contains an internal set of tables that are used when the
final argument of pcre_compile()
is NULL. These are sufficient
for many applications. Normally, the internal tables recognize
only ASCII characters. However, when PCRE is built, it is
possible to cause the internal tables to be rebuilt in the
default "C" locale of the local system, which may cause them to
be different.
The internal tables can always be overridden by tables supplied
by the application that calls PCRE. These may be created in a
different locale from the default. As more and more applications
change to using Unicode, the need for this locale support is
expected to die away.
External tables are built by calling the pcre_maketables()
function, which has no arguments, in the relevant locale. The
result can then be passed to pcre_compile()
as often as
necessary. For example, to build and use tables that are
appropriate for the French locale (where accented characters with
values greater than 128 are treated as letters), the following
code could be used:
setlocale(LC_CTYPE, "fr_FR");
tables = pcre_maketables();
re = pcre_compile(..., tables);
The locale name "fr_FR" is used on Linux and other Unix-like
systems; if you are using Windows, the name for the French locale
is "french".
When pcre_maketables()
runs, the tables are built in memory that
is obtained via pcre_malloc
. It is the caller's responsibility to
ensure that the memory containing the tables remains available
for as long as it is needed.
The pointer that is passed to pcre_compile()
is saved with the
compiled pattern, and the same tables are used via this pointer
by pcre_study()
and also by pcre_exec()
and pcre_dfa_exec()
.
Thus, for any single pattern, compilation, studying and matching
all happen in the same locale, but different patterns can be
processed in different locales.
It is possible to pass a table pointer or NULL (indicating the
use of the internal tables) to pcre_exec()
or pcre_dfa_exec()
(see the discussion below in the section on matching a pattern).
This facility is provided for use with pre-compiled patterns that
have been saved and reloaded. Character tables are not saved
with patterns, so if a non-standard table was used at compile
time, it must be provided again when the reloaded pattern is
matched. Attempting to use this facility to match a pattern in a
different locale from the one in which it was compiled is likely
to lead to anomalous (usually incorrect) results.