соответствие регулярному выражению (regular expression matching)
Обоснование (Rationale)
The regexec() function must fill in all nmatch elements of
pmatch, where nmatch and pmatch are supplied by the application,
even if some elements of pmatch do not correspond to
subexpressions in pattern. The application developer should note
that there is probably no reason for using a value of nmatch that
is larger than preg->re_nsub+1.
The REG_NEWLINE flag supports a use of RE matching that is needed
in some applications like text editors. In such applications, the
user supplies an RE asking the application to find a line that
matches the given expression. An anchor in such an RE anchors at
the beginning or end of any line. Such an application can pass a
sequence of <newline>-separated lines to regexec() as a single
long string and specify REG_NEWLINE to regcomp() to get the
desired behavior. The application must ensure that there are no
explicit <newline> characters in pattern if it wants to ensure
that any match occurs entirely within a single line.
The REG_NEWLINE flag affects the behavior of regexec(), but it is
in the cflags parameter to regcomp() to allow flexibility of
implementation. Some implementations will want to generate the
same compiled RE in regcomp() regardless of the setting of
REG_NEWLINE and have regexec() handle anchors differently based
on the setting of the flag. Other implementations will generate
different compiled REs based on the REG_NEWLINE.
The REG_ICASE flag supports the operations taken by the grep -i
option and the historical implementations of ex and vi.
Including this flag will make it easier for application code to
be written that does the same thing as these utilities.
The substrings reported in pmatch[] are defined using offsets
from the start of the string rather than pointers. This allows
type-safe access to both constant and non-constant strings.
The type regoff_t
is used for the elements of pmatch[] to ensure
that the application can represent large arrays in memory
(important for an application conforming to the Shell and
Utilities volume of POSIX.1‐2017).
The 1992 edition of this standard required regoff_t
to be at
least as wide as off_t
, to facilitate future extensions in which
the string to be searched is taken from a file. However, these
future extensions have not appeared. The requirement rules out
popular implementations with 32-bit regoff_t
and 64-bit off_t
, so
it has been removed.
The standard developers rejected the inclusion of a regsub()
function that would be used to do substitutions for a matched RE.
While such a routine would be useful to some applications, its
utility would be much more limited than the matching function
described here. Both RE parsing and substitution are possible to
implement without support other than that required by the ISO C
standard, but matching is much more complex than substituting.
The only difficult part of substitution, given the information
supplied by regexec(), is finding the next character in a string
when there can be multi-byte characters. That is a much larger
issue, and one that needs a more general solution.
The errno variable has not been used for error returns to avoid
filling the errno name space for this feature.
The interface is defined so that the matched substrings rm_sp and
rm_ep are in a separate regmatch_t
structure instead of in
regex_t
. This allows a single compiled RE to be used
simultaneously in several contexts; in main() and a signal
handler, perhaps, or in multiple threads of lightweight
processes. (The preg argument to regexec() is declared with type
const
, so the implementation is not permitted to use the
structure to store intermediate results.) It also allows an
application to request an arbitrary number of substrings from an
RE. The number of subexpressions in the RE is reported in re_nsub
in preg. With this change to regexec(), consideration was given
to dropping the REG_NOSUB flag since the user can now specify
this with a zero nmatch argument to regexec(). However, keeping
REG_NOSUB allows an implementation to use a different (perhaps
more efficient) algorithm if it knows in regcomp() that no
subexpressions need be reported. The implementation is only
required to fill in pmatch if nmatch is not zero and if REG_NOSUB
is not specified. Note that the size_t
type, as defined in the
ISO C standard, is unsigned, so the description of regexec() does
not need to address negative values of nmatch.
REG_NOTBOL was added to allow an application to do repeated
searches for the same pattern in a line. If the pattern contains
a <circumflex> character that should match the beginning of a
line, then the pattern should only match when matched against the
beginning of the line. Without the REG_NOTBOL flag, the
application could rewrite the expression for subsequent matches,
but in the general case this would require parsing the
expression. The need for REG_NOTEOL is not as clear; it was added
for symmetry.
The addition of the regerror() function addresses the historical
need for conforming application programs to have access to error
information more than ``Function failed to compile/match your RE
for unknown reasons''.
This interface provides for two different methods of dealing with
error conditions. The specific error codes (REG_EBRACE, for
example), defined in <regex.h>, allow an application to recover
from an error if it is so able. Many applications, especially
those that use patterns supplied by a user, will not try to deal
with specific error cases, but will just use regerror() to obtain
a human-readable error message to present to the user.
The regerror() function uses a scheme similar to confstr() to
deal with the problem of allocating memory to hold the generated
string. The scheme used by strerror() in the ISO C standard was
considered unacceptable since it creates difficulties for multi-
threaded applications.
The preg argument is provided to regerror() to allow an
implementation to generate a more descriptive message than would
be possible with errcode alone. An implementation might, for
example, save the character offset of the offending character of
the pattern in a field of preg, and then include that in the
generated message string. The implementation may also ignore
preg.
A REG_FILENAME flag was considered, but omitted. This flag caused
regexec() to match patterns as described in the Shell and
Utilities volume of POSIX.1‐2017, Section 2.13, Pattern Matching
Notation instead of REs. This service is now provided by the
fnmatch() function.
Notice that there is a difference in philosophy between the
ISO POSIX‐2:1993 standard and POSIX.1‐2008 in how to handle a
``bad'' regular expression. The ISO POSIX‐2:1993 standard says
that many bad constructs ``produce undefined results'', or that
``the interpretation is undefined''. POSIX.1‐2008, however, says
that the interpretation of such REs is unspecified. The term
``undefined'' means that the action by the application is an
error, of similar severity to passing a bad pointer to a
function.
The regcomp() and regexec() functions are required to accept any
null-terminated string as the pattern argument. If the meaning of
the string is ``undefined'', the behavior of the function is
``unspecified''. POSIX.1‐2008 does not specify how the functions
will interpret the pattern; they might return error codes, or
they might do pattern matching in some completely unexpected way,
but they should not do something like abort the process.