Путеводитель по Руководству Linux

  User  |  Syst  |  Libr  |  Device  |  Files  |  Other  |  Admin  |  Head  |



   yacc.1p    ( 1 )

еще один компилятор компилятора (РАЗРАБОТКА) (yet another compiler compiler (DEVELOPMENT))

Обоснование (Rationale)

The references in Referenced Documents may be helpful in constructing the parser generator. The referenced DeRemer and Pennello article (along with the works it references) describes a technique to generate parsers that conform to this volume of POSIX.1‐2017. Work in this area continues to be done, so implementors should consult current literature before doing any new implementations. The original Knuth article is the theoretical basis for this kind of parser, but the tables it generates are impractically large for reasonable grammars and should not be used. The ``equivalent to'' wording is intentional to assure that the best tables that are LALR(1) can be generated.

There has been confusion between the class of grammars, the algorithms needed to generate parsers, and the algorithms needed to parse the languages. They are all reasonably orthogonal. In particular, a parser generator that accepts the full range of LR(1) grammars need not generate a table any more complex than one that accepts SLR(1) (a relatively weak class of LR grammars) for a grammar that happens to be SLR(1). Such an implementation need not recognize the case, either; table compression can yield the SLR(1) table (or one even smaller than that) without recognizing that the grammar is SLR(1). The speed of an LR(1) parser for any class is dependent more upon the table representation and compression (or the code generation if a direct parser is generated) than upon the class of grammar that the table generator handles.

The speed of the parser generator is somewhat dependent upon the class of grammar it handles. However, the original Knuth article algorithms for constructing LR parsers were judged by its author to be impractically slow at that time. Although full LR is more complex than LALR(1), as computer speeds and algorithms improve, the difference (in terms of acceptable wall-clock execution time) is becoming less significant.

Potential authors are cautioned that the referenced DeRemer and Pennello article previously cited identifies a bug (an over- simplification of the computation of LALR(1) lookahead sets) in some of the LALR(1) algorithm statements that preceded it to publication. They should take the time to seek out that paper, as well as current relevant work, particularly Aho's.

The -b option was added to provide a portable method for permitting yacc to work on multiple separate parsers in the same directory. If a directory contains more than one yacc grammar, and both grammars are constructed at the same time (by, for example, a parallel make program), conflict results. While the solution is not historical practice, it corrects a known deficiency in historical implementations. Corresponding changes were made to all sections that referenced the filenames y.tab.c (now ``the code file''), y.tab.h (now ``the header file''), and y.output (now ``the description file'').

The grammar for yacc input is based on System V documentation. The textual description shows there that the ';' is required at the end of the rule. The grammar and the implementation do not require this. (The use of C_IDENTIFIER causes a reduce to occur in the right place.)

Also, in that implementation, the constructs such as %token can be terminated by a <semicolon>, but this is not permitted by the grammar. The keywords such as %token can also appear in uppercase, which is again not discussed. In most places where '%' is used, <backslash> can be substituted, and there are alternate spellings for some of the symbols (for example, %LEFT can be "%<" or even "\<").

Historically, <tag> can contain any characters except '>', including white space, in the implementation. However, since the tag must reference an ISO C standard union member, in practice conforming implementations need to support only the set of characters for ISO C standard identifiers in this context.

Some historical implementations are known to accept actions that are terminated by a period. Historical implementations often allow '$' in names. A conforming implementation does not need to support either of these behaviors.

Deciding when to use %prec illustrates the difficulty in specifying the behavior of yacc. There may be situations in which the grammar is not, strictly speaking, in error, and yet yacc cannot interpret it unambiguously. The resolution of ambiguities in the grammar can in many instances be resolved by providing additional information, such as using %type or %union declarations. It is often easier and it usually yields a smaller parser to take this alternative when it is appropriate.

The size and execution time of a program produced without the runtime debugging code is usually smaller and slightly faster in historical implementations.

Statistics messages from several historical implementations include the following types of information:

n/512 terminals, n/300 non-terminals n/600 grammar rules, n/1500 states n shift/reduce, n reduce/reduce conflicts reported n/350 working sets used Memory: states, etc. n/15000, parser n/15000 n/600 distinct lookahead sets n extra closures n shift entries, n exceptions n goto entries n entries saved by goto default Optimizer space used: input n/15000, output n/15000 n table entries, n zero Maximum spread: n, Maximum offset: n

The report of internal tables in the description file is left implementation-defined because all aspects of these limits are also implementation-defined. Some implementations may use dynamic allocation techniques and have no specific limit values to report.

The format of the y.output file is not given because specification of the format was not seen to enhance applications portability. The listing is primarily intended to help human users understand and debug the parser; use of y.output by a conforming application script would be unusual. Furthermore, implementations have not produced consistent output and no popular format was apparent. The format selected by the implementation should be human-readable, in addition to the requirement that it be a text file.

Standard error reports are not specifically described because they are seldom of use to conforming applications and there was no reason to restrict implementations.

Some implementations recognize "={" as equivalent to '{' because it appears in historical documentation. This construction was recognized and documented as obsolete as long ago as 1978, in the referenced Yacc: Yet Another Compiler-Compiler. This volume of POSIX.1‐2017 chose to leave it as obsolete and omit it.

Multi-byte characters should be recognized by the lexical analyzer and returned as tokens. They should not be returned as multi-byte character literals. The token error that is used for error recovery is normally assigned the value 256 in the historical implementation. Thus, the token value 256, which is used in many multi-byte character sets, is not available for use as the value of a user-defined token.