язык сканирования и обработки шаблонов (pattern scanning and processing language)
VARIABLES, RECORDS AND FIELDS
AWK variables are dynamic; they come into existence when they are
first used. Their values are either floating-point numbers or
strings, or both, depending upon how they are used.
Additionally, gawk allows variables to have regular-expression
type. AWK also has one dimensional arrays; arrays with multiple
dimensions may be simulated. Gawk provides true arrays of
arrays; see Arrays
, below. Several pre-defined variables are set
as a program runs; these are described as needed and summarized
below.
Records
Normally, records are separated by newline characters. You can
control how records are separated by assigning values to the
built-in variable RS
. If RS
is any single character, that
character separates records. Otherwise, RS
is a regular
expression. Text in the input that matches this regular
expression separates the record. However, in compatibility mode,
only the first character of its string value is used for
separating records. If RS
is set to the null string, then
records are separated by empty lines. When RS
is set to the null
string, the newline character always acts as a field separator,
in addition to whatever value FS
may have.
Fields
As each input record is read, gawk splits the record into fields,
using the value of the FS
variable as the field separator. If FS
is a single character, fields are separated by that character.
If FS
is the null string, then each individual character becomes
a separate field. Otherwise, FS
is expected to be a full regular
expression. In the special case that FS
is a single space,
fields are separated by runs of spaces and/or tabs and/or
newlines. NOTE
: The value of IGNORECASE
(see below) also affects
how fields are split when FS
is a regular expression, and how
records are separated when RS
is a regular expression.
If the FIELDWIDTHS
variable is set to a space-separated list of
numbers, each field is expected to have fixed width, and gawk
splits up the record using the specified widths. Each field
width may optionally be preceded by a colon-separated value
specifying the number of characters to skip before the field
starts. The value of FS
is ignored. Assigning a new value to FS
or FPAT
overrides the use of FIELDWIDTHS
.
Similarly, if the FPAT
variable is set to a string representing a
regular expression, each field is made up of text that matches
that regular expression. In this case, the regular expression
describes the fields themselves, instead of the text that
separates the fields. Assigning a new value to FS
or FIELDWIDTHS
overrides the use of FPAT
.
Each field in the input record may be referenced by its position:
$1
, $2
, and so on. $0
is the whole record, including leading and
trailing whitespace. Fields need not be referenced by constants:
n = 5
print $n
prints the fifth field in the input record.
The variable NF
is set to the total number of fields in the input
record.
References to non-existent fields (i.e., fields after $NF
)
produce the null string. However, assigning to a non-existent
field (e.g., $(NF+2) = 5
) increases the value of NF
, creates any
intervening fields with the null string as their values, and
causes the value of $0
to be recomputed, with the fields being
separated by the value of OFS
. References to negative numbered
fields cause a fatal error. Decrementing NF
causes the values of
fields past the new value to be lost, and the value of $0
to be
recomputed, with the fields being separated by the value of OFS
.
Assigning a value to an existing field causes the whole record to
be rebuilt when $0
is referenced. Similarly, assigning a value
to $0
causes the record to be resplit, creating new values for
the fields.
Built-in Variables
Gawk's built-in variables are:
ARGC
The number of command line arguments (does not include
options to gawk, or the program source).
ARGIND
The index in ARGV
of the current file being processed.
ARGV
Array of command line arguments. The array is indexed
from 0 to ARGC
- 1. Dynamically changing the contents of
ARGV
can control the files used for data.
BINMODE
On non-POSIX systems, specifies use of 'binary' mode for
all file I/O. Numeric values of 1, 2, or 3, specify that
input files, output files, or all files, respectively,
should use binary I/O. String values of "r"
, or "w"
specify that input files, or output files, respectively,
should use binary I/O. String values of "rw"
or "wr"
specify that all files should use binary I/O. Any other
string value is treated as "rw"
, but generates a warning
message.
CONVFMT
The conversion format for numbers, "%.6g"
, by default.
ENVIRON
An array containing the values of the current environment.
The array is indexed by the environment variables, each
element being the value of that variable (e.g.,
ENVIRON["HOME"]
might be "/home/arnold"
).
In POSIX mode, changing this array does not affect the
environment seen by programs which gawk spawns via
redirection or the system()
function. Otherwise, gawk
updates its real environment so that programs it spawns
see the changes.
ERRNO
If a system error occurs either doing a redirection for
getline
, during a read for getline
, or during a close()
,
then ERRNO
is set to a string describing the error. The
value is subject to translation in non-English locales.
If the string in ERRNO
corresponds to a system error in
the errno(3) variable, then the numeric value can be found
in PROCINFO["errno"].
For non-system errors,
PROCINFO["errno"]
will be zero.
FIELDWIDTHS
A whitespace-separated list of field widths. When set,
gawk parses the input into fields of fixed width, instead
of using the value of the FS
variable as the field
separator. Each field width may optionally be preceded by
a colon-separated value specifying the number of
characters to skip before the field starts. See Fields
,
above.
FILENAME
The name of the current input file. If no files are
specified on the command line, the value of FILENAME
is
'-'. However, FILENAME
is undefined inside the BEGIN
rule
(unless set by getline
).
FNR
The input record number in the current input file.
FPAT
A regular expression describing the contents of the fields
in a record. When set, gawk parses the input into fields,
where the fields match the regular expression, instead of
using the value of FS
as the field separator. See Fields
,
above.
FS
The input field separator, a space by default. See
Fields
, above.
FUNCTAB
An array whose indices and corresponding values are the
names of all the user-defined or extension functions in
the program. NOTE
: You may not use the delete
statement
with the FUNCTAB
array.
IGNORECASE
Controls the case-sensitivity of all regular expression
and string operations. If IGNORECASE
has a non-zero
value, then string comparisons and pattern matching in
rules, field splitting with FS
and FPAT
, record separating
with RS
, regular expression matching with ~
and !~
, and
the gensub()
, gsub()
, index()
, match()
, patsplit()
,
split()
, and sub()
built-in functions all ignore case when
doing regular expression operations. NOTE
: Array
subscripting is not affected. However, the asort()
and
asorti()
functions are affected.
Thus, if IGNORECASE
is not equal to zero, /aB/
matches all
of the strings "ab"
, "aB"
, "Ab"
, and "AB"
. As with all
AWK variables, the initial value of IGNORECASE
is zero, so
all regular expression and string operations are normally
case-sensitive.
LINT
Provides dynamic control of the --lint
option from within
an AWK program. When true, gawk prints lint warnings.
When false, it does not. The values allowed for the
--lint
option may also be assigned to LINT
, with the same
effects. Any other true value just prints warnings.
NF
The number of fields in the current input record.
NR
The total number of input records seen so far.
OFMT
The output format for numbers, "%.6g"
, by default.
OFS
The output field separator, a space by default.
ORS
The output record separator, by default a newline.
PREC
The working precision of arbitrary precision floating-
point numbers, 53 by default.
PROCINFO
The elements of this array provide access to information
about the running AWK program. On some systems, there may
be elements in the array, "group1"
through "group
n"
for
some n, which is the number of supplementary groups that
the process has. Use the in
operator to test for these
elements. The following elements are guaranteed to be
available:
PROCINFO["argv"]
The command line arguments as received by gawk at
the C-language level. The subscripts start from
zero.
PROCINFO["egid"]
The value of the getegid(2) system call.
PROCINFO["errno"]
The value of errno(3) when ERRNO
is set to the
associated error message.
PROCINFO["euid"]
The value of the geteuid(2) system call.
PROCINFO["FS"]
"FS"
if field splitting with FS
is in effect,
"FPAT"
if field splitting with FPAT
is in effect,
"FIELDWIDTHS"
if field splitting with FIELDWIDTHS
is in effect, or "API"
if API input parser field
splitting is in effect.
PROCINFO["gid"]
The value of the getgid(2) system call.
PROCINFO["identifiers"]
A subarray, indexed by the names of all identifiers
used in the text of the AWK program. The values
indicate what gawk knows about the identifiers
after it has finished parsing the program; they are
not updated while the program runs. For each
identifier, the value of the element is one of the
following:
"array"
The identifier is an array.
"builtin"
The identifier is a built-in function.
"extension"
The identifier is an extension function
loaded via @load
or --load
.
"scalar"
The identifier is a scalar.
"untyped"
The identifier is untyped (could be used as
a scalar or array, gawk doesn't know yet).
"user"
The identifier is a user-defined function.
PROCINFO["pgrpid"]
The value of the getpgrp(2) system call.
PROCINFO["pid"]
The value of the getpid(2) system call.
PROCINFO["platform"]
A string indicating the platform for which gawk was
compiled. It is one of:
"djgpp"
, "mingw"
Microsoft Windows, using either DJGPP, or
MinGW, respectively.
"os2"
OS/2.
"posix"
GNU/Linux, Cygwin, Mac OS X, and legacy Unix
systems.
"vms"
OpenVMS or Vax/VMS.
PROCINFO["ppid"]
The value of the getppid(2) system call.
PROCINFO["strftime"]
The default time format string for strftime()
.
Changing its value affects how strftime()
formats
time values when called with no arguments.
PROCINFO["uid"]
The value of the getuid(2) system call.
PROCINFO["version"]
The version of gawk.
The following elements are present if loading dynamic
extensions is available:
PROCINFO["api_major"]
The major version of the extension API.
PROCINFO["api_minor"]
The minor version of the extension API.
The following elements are available if MPFR support is
compiled into gawk:
PROCINFO["gmp_version"]
The version of the GNU GMP library used for
arbitrary precision number support in gawk.
PROCINFO["mpfr_version"]
The version of the GNU MPFR library used for
arbitrary precision number support in gawk.
PROCINFO["prec_max"]
The maximum precision supported by the GNU MPFR
library for arbitrary precision floating-point
numbers.
PROCINFO["prec_min"]
The minimum precision allowed by the GNU MPFR
library for arbitrary precision floating-point
numbers.
The following elements may set by a program to change
gawk's behavior:
PROCINFO["NONFATAL"]
If this exists, then I/O errors for all
redirections become nonfatal.
PROCINFO["
name", "NONFATAL"]
Make I/O errors for name be nonfatal.
PROCINFO["
command", "pty"]
Use a pseudo-tty for two-way communication with
command instead of setting up two one-way pipes.
PROCINFO["
input", "READ_TIMEOUT"]
The timeout in milliseconds for reading data from
input, where input is a redirection string or a
filename. A value of zero or less than zero means
no timeout.
PROCINFO["
input", "RETRY"]
If an I/O error that may be retried occurs when
reading data from input, and this array entry
exists, then getline
returns -2 instead of
following the default behavior of returning -1 and
configuring input to return no further data. An
I/O error that may be retried is one where errno(3)
has the value EAGAIN, EWOULDBLOCK, EINTR, or
ETIMEDOUT. This may be useful in conjunction with
PROCINFO["
input", "READ_TIMEOUT"]
or in situations
where a file descriptor has been configured to
behave in a non-blocking fashion.
PROCINFO["sorted_in"]
If this element exists in PROCINFO
, then its value
controls the order in which array elements are
traversed in for
loops. Supported values are
"@ind_str_asc"
, "@ind_num_asc"
, "@val_type_asc"
,
"@val_str_asc"
, "@val_num_asc"
, "@ind_str_desc"
,
"@ind_num_desc"
, "@val_type_desc"
, "@val_str_desc"
,
"@val_num_desc"
, and "@unsorted"
. The value can
also be the name (as a string) of any comparison
function defined as follows:
function cmp_func(i1, v1, i2, v2)
where i1 and i2 are the indices, and v1 and v2 are
the corresponding values of the two elements being
compared. It should return a number less than,
equal to, or greater than 0, depending on how the
elements of the array are to be ordered.
ROUNDMODE
The rounding mode to use for arbitrary precision
arithmetic on numbers, by default "N"
(IEEE-754
roundTiesToEven mode). The accepted values are:
"A"
or "a"
for rounding away from zero. These are only
available if your version of the GNU MPFR library
supports rounding away from zero.
"D"
or "d"
for roundTowardNegative.
"N"
or "n"
for roundTiesToEven.
"U"
or "u"
for roundTowardPositive.
"Z"
or "z"
for roundTowardZero.
RS
The input record separator, by default a newline.
RT
The record terminator. Gawk sets RT
to the input text
that matched the character or regular expression specified
by RS
.
RSTART
The index of the first character matched by match()
; 0 if
no match. (This implies that character indices start at
one.)
RLENGTH
The length of the string matched by match()
; -1 if no
match.
SUBSEP
The string used to separate multiple subscripts in array
elements, by default "\034"
.
SYMTAB
An array whose indices are the names of all currently
defined global variables and arrays in the program. The
array may be used for indirect access to read or write the
value of a variable:
foo = 5
SYMTAB["foo"] = 4
print foo # prints 4
The typeof()
function may be used to test if an element in
SYMTAB
is an array. You may not use the delete
statement
with the SYMTAB
array, nor assign to elements with an
index that is not a variable name.
TEXTDOMAIN
The text domain of the AWK program; used to find the
localized translations for the program's strings.
Arrays
Arrays are subscripted with an expression between square brackets
([
and ]
). If the expression is an expression list (expr, expr
...) then the array subscript is a string consisting of the
concatenation of the (string) value of each expression, separated
by the value of the SUBSEP
variable. This facility is used to
simulate multiply dimensioned arrays. For example:
i = "A"; j = "B"; k = "C"
x[i, j, k] = "hello, world\n"
assigns the string "hello, world\n"
to the element of the array x
which is indexed by the string "A\034B\034C"
. All arrays in AWK
are associative, i.e., indexed by string values.
The special operator in
may be used to test if an array has an
index consisting of a particular value:
if (val in array)
print array[val]
If the array has multiple subscripts, use (i, j) in array
.
The in
construct may also be used in a for
loop to iterate over
all the elements of an array. However, the (i, j) in array
construct only works in tests, not in for
loops.
An element may be deleted from an array using the delete
statement. The delete
statement may also be used to delete the
entire contents of an array, just by specifying the array name
without a subscript.
gawk supports true multidimensional arrays. It does not require
that such arrays be ``rectangular'' as in C or C++. For example:
a[1] = 5
a[2][1] = 6
a[2][2] = 7
NOTE
: You may need to tell gawk that an array element is really a
subarray in order to use it where gawk expects an array (such as
in the second argument to split()
). You can do this by creating
an element in the subarray and then deleting it with the delete
statement.
Namespaces
Gawk provides a simple namespace facility to help work around the
fact that all variables in AWK are global.
A qualified name consists of a two simple identifiers joined by a
double colon (::
). The left-hand identifier represents the
namespace and the right-hand identifier is the variable within
it. All simple (non-qualified) names are considered to be in the
``current'' namespace; the default namespace is awk
. However,
simple identifiers consisting solely of uppercase letters are
forced into the awk
namespace, even if the current namespace is
different.
You change the current namespace with an @namespace "
name"
directive.
The standard predefined builtin function names may not be used as
namespace names. The names of additional functions provided by
gawk may be used as namespace names or as simple identifiers in
other namespaces. For more details, see GAWK: Effective AWK
Programming.
Variable Typing And Conversion
Variables and fields may be (floating point) numbers, or strings,
or both. They may also be regular expressions. How the value of
a variable is interpreted depends upon its context. If used in a
numeric expression, it will be treated as a number; if used as a
string it will be treated as a string.
To force a variable to be treated as a number, add zero to it; to
force it to be treated as a string, concatenate it with the null
string.
Uninitialized variables have the numeric value zero and the
string value "" (the null, or empty, string).
When a string must be converted to a number, the conversion is
accomplished using strtod(3). A number is converted to a string
by using the value of CONVFMT
as a format string for sprintf(3),
with the numeric value of the variable as the argument. However,
even though all numbers in AWK are floating-point, integral
values are always converted as integers. Thus, given
CONVFMT = "%2.2f"
a = 12
b = a ""
the variable b
has a string value of "12"
and not "12.00"
.
NOTE
: When operating in POSIX mode (such as with the --posix
option), beware that locale settings may interfere with the way
decimal numbers are treated: the decimal separator of the numbers
you are feeding to gawk must conform to what your locale would
expect, be it a comma (,) or a period (.).
Gawk performs comparisons as follows: If two variables are
numeric, they are compared numerically. If one value is numeric
and the other has a string value that is a 'numeric string,' then
comparisons are also done numerically. Otherwise, the numeric
value is converted to a string and a string comparison is
performed. Two strings are compared, of course, as strings.
Note that string constants, such as "57"
, are not numeric
strings, they are string constants. The idea of 'numeric string'
only applies to fields, getline
input, FILENAME
, ARGV
elements,
ENVIRON
elements and the elements of an array created by split()
or patsplit()
that are numeric strings. The basic idea is that
user input, and only user input, that looks numeric, should be
treated that way.
Octal and Hexadecimal Constants
You may use C-style octal and hexadecimal constants in your AWK
program source code. For example, the octal value 011
is equal
to decimal 9
, and the hexadecimal value 0x11
is equal to decimal
17.
String Constants
String constants in AWK are sequences of characters enclosed
between double quotes (like "value"
). Within strings, certain
escape sequences are recognized, as in C. These are:
\\
A literal backslash.
\a
The 'alert' character; usually the ASCII BEL character.
\b
Backspace.
\f
Form-feed.
\n
Newline.
\r
Carriage return.
\t
Horizontal tab.
\v
Vertical tab.
\x
hex digits
The character represented by the string of hexadecimal
digits following the \x
. Up to two following hexadecimal
digits are considered part of the escape sequence. E.g.,
"\x1B"
is the ASCII ESC (escape) character.
\
ddd The character represented by the 1-, 2-, or 3-digit
sequence of octal digits. E.g., "\033"
is the ASCII ESC
(escape) character.
\
c The literal character c.
In compatibility mode, the characters represented by octal and
hexadecimal escape sequences are treated literally when used in
regular expression constants. Thus, /a\52b/
is equivalent to
/a\*b/
.
Regexp Constants
A regular expression constant is a sequence of characters
enclosed between forward slashes (like /value/
). Regular
expression matching is described more fully below; see Regular
Expressions
.
The escape sequences described earlier may also be used inside
constant regular expressions (e.g., /[ \t\f\n\r\v]/
matches
whitespace characters).
Gawk provides strongly typed regular expression constants. These
are written with a leading @
symbol (like so: @/value/
). Such
constants may be assigned to scalars (variables, array elements)
and passed to user-defined functions. Variables that have been so
assigned have regular expression type.