Путеводитель по Руководству Linux

  User  |  Syst  |  Libr  |  Device  |  Files  |  Other  |  Admin  |  Head  |



   awk.1p    ( 1 )

язык сканирования и обработки шаблонов (pattern scanning and processing language)

Обоснование (Rationale)

This description is based on the new awk, ``nawk'', (see the referenced The AWK Programming Language), which introduced a number of new features to the historical awk:

1. New keywords: delete, do, function, return

2. New built-in functions: atan2, close, cos, gsub, match, rand, sin, srand, sub, system

3. New predefined variables: FNR, ARGC, ARGV, RSTART, RLENGTH, SUBSEP

4. New expression operators: ?, :, ,, ^

5. The FS variable and the third argument to split, now treated as extended regular expressions.

6. The operator precedence, changed to more closely match the C language. Two examples of code that operate differently are:

while ( n /= 10 > 1) ... if (!"wk" ~ /bwk/) ...

Several features have been added based on newer implementations of awk:

* Multiple instances of -f progfile are permitted.

* The new option -v assignment.

* The new predefined variable ENVIRON.

* New built-in functions toupper and tolower.

* More formatting capabilities are added to printf to match the ISO C standard.

Earlier versions of this standard required implementations to support multiple adjacent <semicolon>s, lines with one or more <semicolon> before a rule (pattern-action pairs), and lines with only <semicolon>(s). These are not required by this standard and are considered poor programming practice, but can be accepted by an implementation of awk as an extension.

The overall awk syntax has always been based on the C language, with a few features from the shell command language and other sources. Because of this, it is not completely compatible with any other language, which has caused confusion for some users. It is not the intent of the standard developers to address such issues. A few relatively minor changes toward making the language more compatible with the ISO C standard were made; most of these changes are based on similar changes in recent implementations, as described above. There remain several C-language conventions that are not in awk. One of the notable ones is the <comma> operator, which is commonly used to specify multiple expressions in the C language for statement. Also, there are various places where awk is more restrictive than the C language regarding the type of expression that can be used in a given context. These limitations are due to the different features that the awk language does provide.

Regular expressions in awk have been extended somewhat from historical implementations to make them a pure superset of extended regular expressions, as defined by POSIX.1‐2008 (see the Base Definitions volume of POSIX.1‐2017, Section 9.4, Extended Regular Expressions). The main extensions are internationalization features and interval expressions. Historical implementations of awk have long supported <backslash>-escape sequences as an extension to extended regular expressions, and this extension has been retained despite inconsistency with other utilities. The number of escape sequences recognized in both extended regular expressions and strings has varied (generally increasing with time) among implementations. The set specified by POSIX.1‐2008 includes most sequences known to be supported by popular implementations and by the ISO C standard. One sequence that is not supported is hexadecimal value escapes beginning with '\x'. This would allow values expressed in more than 9 bits to be used within awk as in the ISO C standard. However, because this syntax has a non- deterministic length, it does not permit the subsequent character to be a hexadecimal digit. This limitation can be dealt with in the C language by the use of lexical string concatenation. In the awk language, concatenation could also be a solution for strings, but not for extended regular expressions (either lexical ERE tokens or strings used dynamically as regular expressions). Because of this limitation, the feature has not been added to POSIX.1‐2008.

When a string variable is used in a context where an extended regular expression normally appears (where the lexical token ERE is used in the grammar) the string does not contain the literal <slash> characters.

Some versions of awk allow the form:

func name(args, ... ) { statements }

This has been deprecated by the authors of the language, who asked that it not be specified.

Historical implementations of awk produce an error if a next statement is executed in a BEGIN action, and cause awk to terminate if a next statement is executed in an END action. This behavior has not been documented, and it was not believed that it was necessary to standardize it.

The specification of conversions between string and numeric values is much more detailed than in the documentation of historical implementations or in the referenced The AWK Programming Language. Although most of the behavior is designed to be intuitive, the details are necessary to ensure compatible behavior from different implementations. This is especially important in relational expressions since the types of the operands determine whether a string or numeric comparison is performed. From the perspective of an application developer, it is usually sufficient to expect intuitive behavior and to force conversions (by adding zero or concatenating a null string) when the type of an expression does not obviously match what is needed. The intent has been to specify historical practice in almost all cases. The one exception is that, in historical implementations, variables and constants maintain both string and numeric values after their original value is converted by any use. This means that referencing a variable or constant can have unexpected side-effects. For example, with historical implementations the following program:

{ a = "+2" b = 2 if (NR % 2) c = a + b if (a == b) print "numeric comparison" else print "string comparison" }

would perform a numeric comparison (and output numeric comparison) for each odd-numbered line, but perform a string comparison (and output string comparison) for each even-numbered line. POSIX.1‐2008 ensures that comparisons will be numeric if necessary. With historical implementations, the following program:

BEGIN { OFMT = "%e" print 3.14 OFMT = "%f" print 3.14 }

would output "3.140000e+00" twice, because in the second print statement the constant "3.14" would have a string value from the previous conversion. POSIX.1‐2008 requires that the output of the second print statement be "3.140000". The behavior of historical implementations was seen as too unintuitive and unpredictable.

It was pointed out that with the rules contained in early drafts, the following script would print nothing:

BEGIN { y[1.5] = 1 OFMT = "%e" print y[1.5] }

Therefore, a new variable, CONVFMT, was introduced. The OFMT variable is now restricted to affecting output conversions of numbers to strings and CONVFMT is used for internal conversions, such as comparisons or array indexing. The default value is the same as that for OFMT, so unless a program changes CONVFMT (which no historical program would do), it will receive the historical behavior associated with internal string conversions.

The POSIX awk lexical and syntactic conventions are specified more formally than in other sources. Again the intent has been to specify historical practice. One convention that may not be obvious from the formal grammar as in other verbal descriptions is where <newline> characters are acceptable. There are several obvious placements such as terminating a statement, and a <backslash> can be used to escape <newline> characters between any lexical tokens. In addition, <newline> characters without <backslash> characters can follow a comma, an open brace, a logical AND operator ("&&"), a logical OR operator ("||"), the do keyword, the else keyword, and the closing parenthesis of an if, for, or while statement. For example:

{ print $1, $2 }

The requirement that awk add a trailing <newline> to the program argument text is to simplify the grammar, making it match a text file in form. There is no way for an application or test suite to determine whether a literal <newline> is added or whether awk simply acts as if it did.

POSIX.1‐2008 requires several changes from historical implementations in order to support internationalization. Probably the most subtle of these is the use of the decimal-point character, defined by the LC_NUMERIC category of the locale, in representations of floating-point numbers. This locale-specific character is used in recognizing numeric input, in converting between strings and numeric values, and in formatting output. However, regardless of locale, the <period> character (the decimal-point character of the POSIX locale) is the decimal-point character recognized in processing awk programs (including assignments in command line arguments). This is essentially the same convention as the one used in the ISO C standard. The difference is that the C language includes the setlocale() function, which permits an application to modify its locale. Because of this capability, a C application begins executing with its locale set to the C locale, and only executes in the environment-specified locale after an explicit call to setlocale(). However, adding such an elaborate new feature to the awk language was seen as inappropriate for POSIX.1‐2008. It is possible to execute an awk program explicitly in any desired locale by setting the environment in the shell.

The undefined behavior resulting from NULs in extended regular expressions allows future extensions for the GNU gawk program to process binary data.

The behavior in the case of invalid awk programs (including lexical, syntactic, and semantic errors) is undefined because it was considered overly limiting on implementations to specify. In most cases such errors can be expected to produce a diagnostic and a non-zero exit status. However, some implementations may choose to extend the language in ways that make use of certain invalid constructs. Other invalid constructs might be deemed worthy of a warning, but otherwise cause some reasonable behavior. Still other constructs may be very difficult to detect in some implementations. Also, different implementations might detect a given error during an initial parsing of the program (before reading any input files) while others might detect it when executing the program after reading some input. Implementors should be aware that diagnosing errors as early as possible and producing useful diagnostics can ease debugging of applications, and thus make an implementation more usable.

The unspecified behavior from using multi-character RS values is to allow possible future extensions based on extended regular expressions used for record separators. Historical implementations take the first character of the string and ignore the others.

Unspecified behavior when split(string,array,<null>) is used is to allow a proposed future extension that would split up a string into an array of individual characters.

In the context of the getline function, equally good arguments for different precedences of the | and < operators can be made. Historical practice has been that:

getline < "a" "b"

is parsed as:

( getline < "a" ) "b"

although many would argue that the intent was that the file ab should be read. However:

getline < "x" + 1

parses as:

getline < ( "x" + 1 )

Similar problems occur with the | version of getline, particularly in combination with $. For example:

$"echo hi" | getline

(This situation is particularly problematic when used in a print statement, where the |getline part might be a redirection of the print.)

Since in most cases such constructs are not (or at least should not) be used (because they have a natural ambiguity for which there is no conventional parsing), the meaning of these constructs has been made explicitly unspecified. (The effect is that a conforming application that runs into the problem must parenthesize to resolve the ambiguity.) There appeared to be few if any actual uses of such constructs.

Grammars can be written that would cause an error under these circumstances. Where backwards-compatibility is not a large consideration, implementors may wish to use such grammars.

Some historical implementations have allowed some built-in functions to be called without an argument list, the result being a default argument list chosen in some ``reasonable'' way. Use of length as a synonym for length($0) is the only one of these forms that is thought to be widely known or widely used; this particular form is documented in various places (for example, most historical awk reference pages, although not in the referenced The AWK Programming Language) as legitimate practice. With this exception, default argument lists have always been undocumented and vaguely defined, and it is not at all clear how (or if) they should be generalized to user-defined functions. They add no useful functionality and preclude possible future extensions that might need to name functions without calling them. Not standardizing them seems the simplest course. The standard developers considered that length merited special treatment, however, since it has been documented in the past and sees possibly substantial use in historical programs. Accordingly, this usage has been made legitimate, but Issue 5 removed the obsolescent marking for XSI-conforming implementations and many otherwise conforming applications depend on this feature.

In sub and gsub, if repl is a string literal (the lexical token STRING), then two consecutive <backslash> characters should be used in the string to ensure a single <backslash> will precede the <ampersand> when the resultant string is passed to the function. (For example, to specify one literal <ampersand> in the replacement string, use gsub(ERE, "\\&").)

Historically, the only special character in the repl argument of sub and gsub string functions was the <ampersand> ('&') character and preceding it with the <backslash> character was used to turn off its special meaning.

The description in the ISO POSIX‐2:1993 standard introduced behavior such that the <backslash> character was another special character and it was unspecified whether there were any other special characters. This description introduced several portability problems, some of which are described below, and so it has been replaced with the more historical description. Some of the problems include:

* Historically, to create the replacement string, a script could use gsub(ERE, "\\&"), but with the ISO POSIX‐2:1993 standard wording, it was necessary to use gsub(ERE, "\\\\&"). The <backslash> characters are doubled here because all string literals are subject to lexical analysis, which would reduce each pair of <backslash> characters to a single <backslash> before being passed to gsub.

* Since it was unspecified what the special characters were, for portable scripts to guarantee that characters are printed literally, each character had to be preceded with a <backslash>. (For example, a portable script had to use gsub(ERE, "\\h\\i") to produce a replacement string of "hi".)

The description for comparisons in the ISO POSIX‐2:1993 standard did not properly describe historical practice because of the way numeric strings are compared as numbers. The current rules cause the following code:

if (0 == "000") print "strange, but true" else print "not true"

to do a numeric comparison, causing the if to succeed. It should be intuitively obvious that this is incorrect behavior, and indeed, no historical implementation of awk actually behaves this way.

To fix this problem, the definition of numeric string was enhanced to include only those values obtained from specific circumstances (mostly external sources) where it is not possible to determine unambiguously whether the value is intended to be a string or a numeric.

Variables that are assigned to a numeric string shall also be treated as a numeric string. (For example, the notion of a numeric string can be propagated across assignments.) In comparisons, all variables having the uninitialized value are to be treated as a numeric operand evaluating to the numeric value zero.

Uninitialized variables include all types of variables including scalars, array elements, and fields. The definition of an uninitialized value in Variables and Special Variables is necessary to describe the value placed on uninitialized variables and on fields that are valid (for example, < $NF) but have no characters in them and to describe how these variables are to be used in comparisons. A valid field, such as $1, that has no characters in it can be obtained from an input line of "\t\t" when FS='\t'. Historically, the comparison ($1<10) was done numerically after evaluating $1 to the value zero.

The phrase ``... also shall have the numeric value of the numeric string'' was removed from several sections of the ISO POSIX‐2:1993 standard because is specifies an unnecessary implementation detail. It is not necessary for POSIX.1‐2008 to specify that these objects be assigned two different values. It is only necessary to specify that these objects may evaluate to two different values depending on context.

Historical implementations of awk did not parse hexadecimal integer or floating constants like "0xa" and "0xap0". Due to an oversight, the 2001 through 2004 editions of this standard required support for hexadecimal floating constants. This was due to the reference to atof(). This version of the standard allows but does not require implementations to use atof() and includes a description of how floating-point numbers are recognized as an alternative to match historic behavior. The intent of this change is to allow implementations to recognize floating-point constants according to either the ISO/IEC 9899:1990 standard or ISO/IEC 9899:1999 standard, and to allow (but not require) implementations to recognize hexadecimal integer constants.

Historical implementations of awk did not support floating-point infinities and NaNs in numeric strings; e.g., "-INF" and "NaN". However, implementations that use the atof() or strtod() functions to do the conversion picked up support for these values if they used a ISO/IEC 9899:1999 standard version of the function instead of a ISO/IEC 9899:1990 standard version. Due to an oversight, the 2001 through 2004 editions of this standard did not allow support for infinities and NaNs, but in this revision support is allowed (but not required). This is a silent change to the behavior of awk programs; for example, in the POSIX locale the expression:

("-INF" + 0 < 0)

formerly had the value 0 because "-INF" converted to 0, but now it may have the value 0 or 1.