еще один компилятор компилятора (РАЗРАБОТКА) (yet another compiler compiler (DEVELOPMENT))
Обоснование (Rationale)
The references in Referenced Documents
may be helpful in
constructing the parser generator. The referenced DeRemer and
Pennello article (along with the works it references) describes a
technique to generate parsers that conform to this volume of
POSIX.1‐2017. Work in this area continues to be done, so
implementors should consult current literature before doing any
new implementations. The original Knuth article is the
theoretical basis for this kind of parser, but the tables it
generates are impractically large for reasonable grammars and
should not be used. The ``equivalent to'' wording is intentional
to assure that the best tables that are LALR(1) can be generated.
There has been confusion between the class of grammars, the
algorithms needed to generate parsers, and the algorithms needed
to parse the languages. They are all reasonably orthogonal. In
particular, a parser generator that accepts the full range of
LR(1) grammars need not generate a table any more complex than
one that accepts SLR(1) (a relatively weak class of LR grammars)
for a grammar that happens to be SLR(1). Such an implementation
need not recognize the case, either; table compression can yield
the SLR(1) table (or one even smaller than that) without
recognizing that the grammar is SLR(1). The speed of an LR(1)
parser for any class is dependent more upon the table
representation and compression (or the code generation if a
direct parser is generated) than upon the class of grammar that
the table generator handles.
The speed of the parser generator is somewhat dependent upon the
class of grammar it handles. However, the original Knuth article
algorithms for constructing LR parsers were judged by its author
to be impractically slow at that time. Although full LR is more
complex than LALR(1), as computer speeds and algorithms improve,
the difference (in terms of acceptable wall-clock execution time)
is becoming less significant.
Potential authors are cautioned that the referenced DeRemer and
Pennello article previously cited identifies a bug (an over-
simplification of the computation of LALR(1) lookahead sets) in
some of the LALR(1) algorithm statements that preceded it to
publication. They should take the time to seek out that paper, as
well as current relevant work, particularly Aho's.
The -b
option was added to provide a portable method for
permitting yacc to work on multiple separate parsers in the same
directory. If a directory contains more than one yacc grammar,
and both grammars are constructed at the same time (by, for
example, a parallel make program), conflict results. While the
solution is not historical practice, it corrects a known
deficiency in historical implementations. Corresponding changes
were made to all sections that referenced the filenames y.tab.c
(now ``the code file''), y.tab.h
(now ``the header file''), and
y.output
(now ``the description file'').
The grammar for yacc input is based on System V documentation.
The textual description shows there that the ';'
is required at
the end of the rule. The grammar and the implementation do not
require this. (The use of C_IDENTIFIER
causes a reduce to occur
in the right place.)
Also, in that implementation, the constructs such as %token
can
be terminated by a <semicolon>, but this is not permitted by the
grammar. The keywords such as %token
can also appear in
uppercase, which is again not discussed. In most places where '%'
is used, <backslash> can be substituted, and there are alternate
spellings for some of the symbols (for example, %LEFT
can be "%<"
or even "\<"
).
Historically, <tag> can contain any characters except '>'
,
including white space, in the implementation. However, since the
tag must reference an ISO C standard union member, in practice
conforming implementations need to support only the set of
characters for ISO C standard identifiers in this context.
Some historical implementations are known to accept actions that
are terminated by a period. Historical implementations often
allow '$'
in names. A conforming implementation does not need to
support either of these behaviors.
Deciding when to use %prec
illustrates the difficulty in
specifying the behavior of yacc. There may be situations in
which the grammar is not, strictly speaking, in error, and yet
yacc cannot interpret it unambiguously. The resolution of
ambiguities in the grammar can in many instances be resolved by
providing additional information, such as using %type
or %union
declarations. It is often easier and it usually yields a smaller
parser to take this alternative when it is appropriate.
The size and execution time of a program produced without the
runtime debugging code is usually smaller and slightly faster in
historical implementations.
Statistics messages from several historical implementations
include the following types of information:
n/512 terminals, n/300 non-terminals
n/600 grammar rules, n/1500 states
n shift/reduce, n reduce/reduce conflicts reported
n/350 working sets used
Memory: states, etc. n/15000, parser n/15000
n/600 distinct lookahead sets
n extra closures
n shift entries, n exceptions
n goto entries
n entries saved by goto default
Optimizer space used: input n/15000, output n/15000
n table entries, n zero
Maximum spread: n, Maximum offset: n
The report of internal tables in the description file is left
implementation-defined because all aspects of these limits are
also implementation-defined. Some implementations may use dynamic
allocation techniques and have no specific limit values to
report.
The format of the y.output
file is not given because
specification of the format was not seen to enhance applications
portability. The listing is primarily intended to help human
users understand and debug the parser; use of y.output
by a
conforming application script would be unusual. Furthermore,
implementations have not produced consistent output and no
popular format was apparent. The format selected by the
implementation should be human-readable, in addition to the
requirement that it be a text file.
Standard error reports are not specifically described because
they are seldom of use to conforming applications and there was
no reason to restrict implementations.
Some implementations recognize "={"
as equivalent to '{'
because
it appears in historical documentation. This construction was
recognized and documented as obsolete as long ago as 1978, in the
referenced Yacc: Yet Another Compiler-Compiler. This volume of
POSIX.1‐2017 chose to leave it as obsolete and omit it.
Multi-byte characters should be recognized by the lexical
analyzer and returned as tokens. They should not be returned as
multi-byte character literals. The token error
that is used for
error recovery is normally assigned the value 256 in the
historical implementation. Thus, the token value 256, which is
used in many multi-byte character sets, is not available for use
as the value of a user-defined token.