This description is based on the new awk, ``nawk'', (see the
referenced The AWK Programming Language), which introduced a
number of new features to the historical awk:
1. New keywords: delete
, do
, function
, return
2. New built-in functions: atan2
, close
, cos
, gsub
, match
, rand
,
sin
, srand
, sub
, system
3. New predefined variables: FNR
, ARGC
, ARGV
, RSTART
, RLENGTH
,
SUBSEP
4. New expression operators: ?
, :
, ,
, ^
5. The FS
variable and the third argument to split
, now treated
as extended regular expressions.
6. The operator precedence, changed to more closely match the C
language. Two examples of code that operate differently are:
while ( n /= 10 > 1) ...
if (!"wk" ~ /bwk/) ...
Several features have been added based on newer implementations
of awk:
* Multiple instances of -f
progfile are permitted.
* The new option -v
assignment.
* The new predefined variable ENVIRON
.
* New built-in functions toupper
and tolower
.
* More formatting capabilities are added to printf
to match the
ISO C standard.
Earlier versions of this standard required implementations to
support multiple adjacent <semicolon>s, lines with one or more
<semicolon> before a rule (pattern-action pairs), and lines with
only <semicolon>(s). These are not required by this standard and
are considered poor programming practice, but can be accepted by
an implementation of awk as an extension.
The overall awk syntax has always been based on the C language,
with a few features from the shell command language and other
sources. Because of this, it is not completely compatible with
any other language, which has caused confusion for some users. It
is not the intent of the standard developers to address such
issues. A few relatively minor changes toward making the language
more compatible with the ISO C standard were made; most of these
changes are based on similar changes in recent implementations,
as described above. There remain several C-language conventions
that are not in awk. One of the notable ones is the <comma>
operator, which is commonly used to specify multiple expressions
in the C language for
statement. Also, there are various places
where awk is more restrictive than the C language regarding the
type of expression that can be used in a given context. These
limitations are due to the different features that the awk
language does provide.
Regular expressions in awk have been extended somewhat from
historical implementations to make them a pure superset of
extended regular expressions, as defined by POSIX.1‐2008 (see the
Base Definitions volume of POSIX.1‐2017, Section 9.4, Extended
Regular Expressions). The main extensions are
internationalization features and interval expressions.
Historical implementations of awk have long supported
<backslash>-escape sequences as an extension to extended regular
expressions, and this extension has been retained despite
inconsistency with other utilities. The number of escape
sequences recognized in both extended regular expressions and
strings has varied (generally increasing with time) among
implementations. The set specified by POSIX.1‐2008 includes most
sequences known to be supported by popular implementations and by
the ISO C standard. One sequence that is not supported is
hexadecimal value escapes beginning with '\x'
. This would allow
values expressed in more than 9 bits to be used within awk as in
the ISO C standard. However, because this syntax has a non-
deterministic length, it does not permit the subsequent character
to be a hexadecimal digit. This limitation can be dealt with in
the C language by the use of lexical string concatenation. In the
awk language, concatenation could also be a solution for strings,
but not for extended regular expressions (either lexical ERE
tokens or strings used dynamically as regular expressions).
Because of this limitation, the feature has not been added to
POSIX.1‐2008.
When a string variable is used in a context where an extended
regular expression normally appears (where the lexical token ERE
is used in the grammar) the string does not contain the literal
<slash> characters.
Some versions of awk allow the form:
func name(args, ... ) { statements }
This has been deprecated by the authors of the language, who
asked that it not be specified.
Historical implementations of awk produce an error if a next
statement is executed in a BEGIN
action, and cause awk to
terminate if a next
statement is executed in an END
action. This
behavior has not been documented, and it was not believed that it
was necessary to standardize it.
The specification of conversions between string and numeric
values is much more detailed than in the documentation of
historical implementations or in the referenced The AWK
Programming Language. Although most of the behavior is designed
to be intuitive, the details are necessary to ensure compatible
behavior from different implementations. This is especially
important in relational expressions since the types of the
operands determine whether a string or numeric comparison is
performed. From the perspective of an application developer, it
is usually sufficient to expect intuitive behavior and to force
conversions (by adding zero or concatenating a null string) when
the type of an expression does not obviously match what is
needed. The intent has been to specify historical practice in
almost all cases. The one exception is that, in historical
implementations, variables and constants maintain both string and
numeric values after their original value is converted by any
use. This means that referencing a variable or constant can have
unexpected side-effects. For example, with historical
implementations the following program:
{
a = "+2"
b = 2
if (NR % 2)
c = a + b
if (a == b)
print "numeric comparison"
else
print "string comparison"
}
would perform a numeric comparison (and output numeric
comparison) for each odd-numbered line, but perform a string
comparison (and output string comparison) for each even-numbered
line. POSIX.1‐2008 ensures that comparisons will be numeric if
necessary. With historical implementations, the following
program:
BEGIN {
OFMT = "%e"
print 3.14
OFMT = "%f"
print 3.14
}
would output "3.140000e+00"
twice, because in the second print
statement the constant "3.14"
would have a string value from the
previous conversion. POSIX.1‐2008 requires that the output of the
second print
statement be "3.140000"
. The behavior of historical
implementations was seen as too unintuitive and unpredictable.
It was pointed out that with the rules contained in early drafts,
the following script would print nothing:
BEGIN {
y[1.5] = 1
OFMT = "%e"
print y[1.5]
}
Therefore, a new variable, CONVFMT
, was introduced. The OFMT
variable is now restricted to affecting output conversions of
numbers to strings and CONVFMT
is used for internal conversions,
such as comparisons or array indexing. The default value is the
same as that for OFMT
, so unless a program changes CONVFMT
(which
no historical program would do), it will receive the historical
behavior associated with internal string conversions.
The POSIX awk lexical and syntactic conventions are specified
more formally than in other sources. Again the intent has been to
specify historical practice. One convention that may not be
obvious from the formal grammar as in other verbal descriptions
is where <newline> characters are acceptable. There are several
obvious placements such as terminating a statement, and a
<backslash> can be used to escape <newline> characters between
any lexical tokens. In addition, <newline> characters without
<backslash> characters can follow a comma, an open brace, a
logical AND operator ("&&"
), a logical OR operator ("||"
), the do
keyword, the else
keyword, and the closing parenthesis of an if
,
for
, or while
statement. For example:
{ print $1,
$2 }
The requirement that awk add a trailing <newline> to the program
argument text is to simplify the grammar, making it match a text
file in form. There is no way for an application or test suite to
determine whether a literal <newline> is added or whether awk
simply acts as if it did.
POSIX.1‐2008 requires several changes from historical
implementations in order to support internationalization.
Probably the most subtle of these is the use of the decimal-point
character, defined by the LC_NUMERIC category of the locale, in
representations of floating-point numbers. This locale-specific
character is used in recognizing numeric input, in converting
between strings and numeric values, and in formatting output.
However, regardless of locale, the <period> character (the
decimal-point character of the POSIX locale) is the decimal-point
character recognized in processing awk programs (including
assignments in command line arguments). This is essentially the
same convention as the one used in the ISO C standard. The
difference is that the C language includes the setlocale()
function, which permits an application to modify its locale.
Because of this capability, a C application begins executing with
its locale set to the C locale, and only executes in the
environment-specified locale after an explicit call to
setlocale(). However, adding such an elaborate new feature to
the awk language was seen as inappropriate for POSIX.1‐2008. It
is possible to execute an awk program explicitly in any desired
locale by setting the environment in the shell.
The undefined behavior resulting from NULs in extended regular
expressions allows future extensions for the GNU gawk program to
process binary data.
The behavior in the case of invalid awk programs (including
lexical, syntactic, and semantic errors) is undefined because it
was considered overly limiting on implementations to specify. In
most cases such errors can be expected to produce a diagnostic
and a non-zero exit status. However, some implementations may
choose to extend the language in ways that make use of certain
invalid constructs. Other invalid constructs might be deemed
worthy of a warning, but otherwise cause some reasonable
behavior. Still other constructs may be very difficult to detect
in some implementations. Also, different implementations might
detect a given error during an initial parsing of the program
(before reading any input files) while others might detect it
when executing the program after reading some input. Implementors
should be aware that diagnosing errors as early as possible and
producing useful diagnostics can ease debugging of applications,
and thus make an implementation more usable.
The unspecified behavior from using multi-character RS
values is
to allow possible future extensions based on extended regular
expressions used for record separators. Historical
implementations take the first character of the string and ignore
the others.
Unspecified behavior when split(string,array,<null>) is used is
to allow a proposed future extension that would split up a string
into an array of individual characters.
In the context of the getline
function, equally good arguments
for different precedences of the |
and <
operators can be made.
Historical practice has been that:
getline < "a" "b"
is parsed as:
( getline < "a" ) "b"
although many would argue that the intent was that the file ab
should be read. However:
getline < "x" + 1
parses as:
getline < ( "x" + 1 )
Similar problems occur with the |
version of getline
,
particularly in combination with $
. For example:
$"echo hi" | getline
(This situation is particularly problematic when used in a print
statement, where the |getline
part might be a redirection of the
print
.)
Since in most cases such constructs are not (or at least should
not) be used (because they have a natural ambiguity for which
there is no conventional parsing), the meaning of these
constructs has been made explicitly unspecified. (The effect is
that a conforming application that runs into the problem must
parenthesize to resolve the ambiguity.) There appeared to be few
if any actual uses of such constructs.
Grammars can be written that would cause an error under these
circumstances. Where backwards-compatibility is not a large
consideration, implementors may wish to use such grammars.
Some historical implementations have allowed some built-in
functions to be called without an argument list, the result being
a default argument list chosen in some ``reasonable'' way. Use of
length
as a synonym for length($0)
is the only one of these forms
that is thought to be widely known or widely used; this
particular form is documented in various places (for example,
most historical awk reference pages, although not in the
referenced The AWK Programming Language) as legitimate practice.
With this exception, default argument lists have always been
undocumented and vaguely defined, and it is not at all clear how
(or if) they should be generalized to user-defined functions.
They add no useful functionality and preclude possible future
extensions that might need to name functions without calling
them. Not standardizing them seems the simplest course. The
standard developers considered that length
merited special
treatment, however, since it has been documented in the past and
sees possibly substantial use in historical programs.
Accordingly, this usage has been made legitimate, but Issue 5
removed the obsolescent marking for XSI-conforming
implementations and many otherwise conforming applications depend
on this feature.
In sub
and gsub
, if repl is a string literal (the lexical token
STRING
), then two consecutive <backslash> characters should be
used in the string to ensure a single <backslash> will precede
the <ampersand> when the resultant string is passed to the
function. (For example, to specify one literal <ampersand> in the
replacement string, use gsub
(ERE
, "\\&"
).)
Historically, the only special character in the repl argument of
sub
and gsub
string functions was the <ampersand> ('&'
) character
and preceding it with the <backslash> character was used to turn
off its special meaning.
The description in the ISO POSIX‐2:1993 standard introduced
behavior such that the <backslash> character was another special
character and it was unspecified whether there were any other
special characters. This description introduced several
portability problems, some of which are described below, and so
it has been replaced with the more historical description. Some
of the problems include:
* Historically, to create the replacement string, a script
could use gsub
(ERE
, "\\&"
), but with the ISO POSIX‐2:1993
standard wording, it was necessary to use gsub
(ERE
, "\\\\&"
).
The <backslash> characters are doubled here because all
string literals are subject to lexical analysis, which would
reduce each pair of <backslash> characters to a single
<backslash> before being passed to gsub
.
* Since it was unspecified what the special characters were,
for portable scripts to guarantee that characters are printed
literally, each character had to be preceded with a
<backslash>. (For example, a portable script had to use
gsub
(ERE
, "\\h\\i"
) to produce a replacement string of "hi"
.)
The description for comparisons in the ISO POSIX‐2:1993 standard
did not properly describe historical practice because of the way
numeric strings are compared as numbers. The current rules cause
the following code:
if (0 == "000")
print "strange, but true"
else
print "not true"
to do a numeric comparison, causing the if
to succeed. It should
be intuitively obvious that this is incorrect behavior, and
indeed, no historical implementation of awk actually behaves this
way.
To fix this problem, the definition of numeric string was
enhanced to include only those values obtained from specific
circumstances (mostly external sources) where it is not possible
to determine unambiguously whether the value is intended to be a
string or a numeric.
Variables that are assigned to a numeric string shall also be
treated as a numeric string. (For example, the notion of a
numeric string can be propagated across assignments.) In
comparisons, all variables having the uninitialized value are to
be treated as a numeric operand evaluating to the numeric value
zero.
Uninitialized variables include all types of variables including
scalars, array elements, and fields. The definition of an
uninitialized value in Variables and Special Variables is
necessary to describe the value placed on uninitialized variables
and on fields that are valid (for example, < $NF
) but have no
characters in them and to describe how these variables are to be
used in comparisons. A valid field, such as $1
, that has no
characters in it can be obtained from an input line of "\t\t"
when FS='\t'
. Historically, the comparison ($1<
10) was done
numerically after evaluating $1
to the value zero.
The phrase ``... also shall have the numeric value of the numeric
string'' was removed from several sections of the
ISO POSIX‐2:1993 standard because is specifies an unnecessary
implementation detail. It is not necessary for POSIX.1‐2008 to
specify that these objects be assigned two different values. It
is only necessary to specify that these objects may evaluate to
two different values depending on context.
Historical implementations of awk did not parse hexadecimal
integer or floating constants like "0xa"
and "0xap0"
. Due to an
oversight, the 2001 through 2004 editions of this standard
required support for hexadecimal floating constants. This was due
to the reference to atof(). This version of the standard allows
but does not require implementations to use atof() and includes a
description of how floating-point numbers are recognized as an
alternative to match historic behavior. The intent of this change
is to allow implementations to recognize floating-point constants
according to either the ISO/IEC 9899:1990 standard or
ISO/IEC 9899:1999 standard, and to allow (but not require)
implementations to recognize hexadecimal integer constants.
Historical implementations of awk did not support floating-point
infinities and NaNs in numeric strings; e.g., "-INF"
and "NaN"
.
However, implementations that use the atof() or strtod()
functions to do the conversion picked up support for these values
if they used a ISO/IEC 9899:1999 standard version of the function
instead of a ISO/IEC 9899:1990 standard version. Due to an
oversight, the 2001 through 2004 editions of this standard did
not allow support for infinities and NaNs, but in this revision
support is allowed (but not required). This is a silent change to
the behavior of awk programs; for example, in the POSIX locale
the expression:
("-INF" + 0 < 0)
formerly had the value 0 because "-INF"
converted to 0, but now
it may have the value 0 or 1.