Even though the -c
option and references to the C language are
retained in this description, lex may be generalized to other
languages, as was done at one time for EFL, the Extended FORTRAN
Language. Since the lex input specification is essentially
language-independent, versions of this utility could be written
to produce Ada, Modula-2, or Pascal code, and there are known
historical implementations that do so.
The current description of lex bypasses the issue of dealing with
internationalized EREs in the lex source code or generated
lexical analyzer. If it follows the model used by awk (the source
code is assumed to be presented in the POSIX locale, but input
and output are in the locale specified by the environment
variables), then the tables in the lexical analyzer produced by
lex would interpret EREs specified in the lex source in terms of
the environment variables specified when lex was executed. The
desired effect would be to have the lexical analyzer interpret
the EREs given in the lex source according to the environment
specified when the lexical analyzer is executed, but this is not
possible with the current lex technology.
The description of octal and hexadecimal-digit escape sequences
agrees with the ISO C standard use of escape sequences.
Earlier versions of this standard allowed for implementations
with bytes other than eight bits, but this has been modified in
this version.
There is no detailed output format specification. The observed
behavior of lex under four different historical implementations
was that none of these implementations consistently reported the
line numbers for error and warning messages. Furthermore, there
was a desire that lex be allowed to output additional diagnostic
messages. Leaving message formats unspecified avoids these
formatting questions and problems with internationalization.
Although the %x
specifier for exclusive start conditions is not
historical practice, it is believed to be a minor change to
historical implementations and greatly enhances the usability of
lex programs since it permits an application to obtain the
expected functionality with fewer statements.
The %array
and %pointer
declarations were added as a compromise
between historical systems. The System V-based lex copies the
matched text to a yytext array. The flex program, supported in
BSD and GNU systems, uses a pointer. In the latter case,
significant performance improvements are available for some
scanners. Most historical programs should require no change in
porting from one system to another because the string being
referenced is null-terminated in both cases. (The method used by
flex in its case is to null-terminate the token in place by
remembering the character that used to come right after the token
and replacing it before continuing on to the next scan.) Multi-
file programs with external references to yytext outside the
scanner source file should continue to operate on their
historical systems, but would require one of the new declarations
to be considered strictly portable.
The description of EREs avoids unnecessary duplication of ERE
details because their meanings within a lex ERE are the same as
that for the ERE in this volume of POSIX.1‐2017.
The reason for the undefined condition associated with text
beginning with a <blank> or within "%{"
and "%}"
delimiter lines
appearing in the Rules section is historical practice. Both the
BSD and System V lex copy the indented (or enclosed) input in the
Rules section (except at the beginning) to unreachable areas of
the yylex() function (the code is written directly after a break
statement). In some cases, the System V lex generates an error
message or a syntax error, depending on the form of indented
input.
The intention in breaking the list of functions into those that
may appear in lex.yy.c
versus those that only appear in libl.a
is
that only those functions in libl.a
can be reliably redefined by
a conforming application.
The descriptions of standard output and standard error are
somewhat complicated because historical lex implementations chose
to issue diagnostic messages to standard output (unless -t
was
given). POSIX.1‐2008 allows this behavior, but leaves an opening
for the more expected behavior of using standard error for
diagnostics. Also, the System V behavior of writing the
statistics when any table sizes are given is allowed, while BSD-
derived systems can avoid it. The programmer can always precisely
obtain the desired results by using either the -t
or -n
options.
The OPERANDS section does not mention the use of -
as a synonym
for standard input; not all historical implementations support
such usage for any of the file operands.
A description of the translation table was deleted from early
proposals because of its relatively low usage in historical
applications.
The change to the definition of the input() function that allows
buffering of input presents the opportunity for major performance
gains in some applications.
The following examples clarify the differences between lex
regular expressions and regular expressions appearing elsewhere
in this volume of POSIX.1‐2017. For regular expressions of the
form "r/x"
, the string matching r is always returned; confusion
may arise when the beginning of x matches the trailing portion of
r. For example, given the regular expression "a*b/cc"
and the
input "aaabcc"
, yytext would contain the string "aaab"
on this
match. But given the regular expression "x*/xy"
and the input
"xxxy"
, the token xxx
, not xx
, is returned by some
implementations because xxx
matches "x*"
.
In the rule "ab*/bc"
, the "b*"
at the end of r extends r's match
into the beginning of the trailing context, so the result is
unspecified. If this rule were "ab/bc"
, however, the rule matches
the text "ab"
when it is followed by the text "bc"
. In this
latter case, the matching of r cannot extend into the beginning
of x, so the result is specified.