Путеводитель по Руководству Linux

  User  |  Syst  |  Libr  |  Device  |  Files  |  Other  |  Admin  |  Head  |



   gawk    ( 1 )

язык сканирования и обработки шаблонов (pattern scanning and processing language)

VARIABLES, RECORDS AND FIELDS

AWK variables are dynamic; they come into existence when they are first used. Their values are either floating-point numbers or strings, or both, depending upon how they are used. Additionally, gawk allows variables to have regular-expression type. AWK also has one dimensional arrays; arrays with multiple dimensions may be simulated. Gawk provides true arrays of arrays; see Arrays, below. Several pre-defined variables are set as a program runs; these are described as needed and summarized below.

Records Normally, records are separated by newline characters. You can control how records are separated by assigning values to the built-in variable RS. If RS is any single character, that character separates records. Otherwise, RS is a regular expression. Text in the input that matches this regular expression separates the record. However, in compatibility mode, only the first character of its string value is used for separating records. If RS is set to the null string, then records are separated by empty lines. When RS is set to the null string, the newline character always acts as a field separator, in addition to whatever value FS may have.

Fields As each input record is read, gawk splits the record into fields, using the value of the FS variable as the field separator. If FS is a single character, fields are separated by that character. If FS is the null string, then each individual character becomes a separate field. Otherwise, FS is expected to be a full regular expression. In the special case that FS is a single space, fields are separated by runs of spaces and/or tabs and/or newlines. NOTE: The value of IGNORECASE (see below) also affects how fields are split when FS is a regular expression, and how records are separated when RS is a regular expression.

If the FIELDWIDTHS variable is set to a space-separated list of numbers, each field is expected to have fixed width, and gawk splits up the record using the specified widths. Each field width may optionally be preceded by a colon-separated value specifying the number of characters to skip before the field starts. The value of FS is ignored. Assigning a new value to FS or FPAT overrides the use of FIELDWIDTHS.

Similarly, if the FPAT variable is set to a string representing a regular expression, each field is made up of text that matches that regular expression. In this case, the regular expression describes the fields themselves, instead of the text that separates the fields. Assigning a new value to FS or FIELDWIDTHS overrides the use of FPAT.

Each field in the input record may be referenced by its position: $1, $2, and so on. $0 is the whole record, including leading and trailing whitespace. Fields need not be referenced by constants:

n = 5 print $n

prints the fifth field in the input record.

The variable NF is set to the total number of fields in the input record.

References to non-existent fields (i.e., fields after $NF) produce the null string. However, assigning to a non-existent field (e.g., $(NF+2) = 5) increases the value of NF, creates any intervening fields with the null string as their values, and causes the value of $0 to be recomputed, with the fields being separated by the value of OFS. References to negative numbered fields cause a fatal error. Decrementing NF causes the values of fields past the new value to be lost, and the value of $0 to be recomputed, with the fields being separated by the value of OFS.

Assigning a value to an existing field causes the whole record to be rebuilt when $0 is referenced. Similarly, assigning a value to $0 causes the record to be resplit, creating new values for the fields.

Built-in Variables Gawk's built-in variables are:

ARGC The number of command line arguments (does not include options to gawk, or the program source).

ARGIND The index in ARGV of the current file being processed.

ARGV Array of command line arguments. The array is indexed from 0 to ARGC - 1. Dynamically changing the contents of ARGV can control the files used for data.

BINMODE On non-POSIX systems, specifies use of 'binary' mode for all file I/O. Numeric values of 1, 2, or 3, specify that input files, output files, or all files, respectively, should use binary I/O. String values of "r", or "w" specify that input files, or output files, respectively, should use binary I/O. String values of "rw" or "wr" specify that all files should use binary I/O. Any other string value is treated as "rw", but generates a warning message.

CONVFMT The conversion format for numbers, "%.6g", by default.

ENVIRON An array containing the values of the current environment. The array is indexed by the environment variables, each element being the value of that variable (e.g., ENVIRON["HOME"] might be "/home/arnold").

In POSIX mode, changing this array does not affect the environment seen by programs which gawk spawns via redirection or the system() function. Otherwise, gawk updates its real environment so that programs it spawns see the changes.

ERRNO If a system error occurs either doing a redirection for getline, during a read for getline, or during a close(), then ERRNO is set to a string describing the error. The value is subject to translation in non-English locales. If the string in ERRNO corresponds to a system error in the errno(3) variable, then the numeric value can be found in PROCINFO["errno"]. For non-system errors, PROCINFO["errno"] will be zero.

FIELDWIDTHS A whitespace-separated list of field widths. When set, gawk parses the input into fields of fixed width, instead of using the value of the FS variable as the field separator. Each field width may optionally be preceded by a colon-separated value specifying the number of characters to skip before the field starts. See Fields, above.

FILENAME The name of the current input file. If no files are specified on the command line, the value of FILENAME is '-'. However, FILENAME is undefined inside the BEGIN rule (unless set by getline).

FNR The input record number in the current input file.

FPAT A regular expression describing the contents of the fields in a record. When set, gawk parses the input into fields, where the fields match the regular expression, instead of using the value of FS as the field separator. See Fields, above.

FS The input field separator, a space by default. See Fields, above.

FUNCTAB An array whose indices and corresponding values are the names of all the user-defined or extension functions in the program. NOTE: You may not use the delete statement with the FUNCTAB array.

IGNORECASE Controls the case-sensitivity of all regular expression and string operations. If IGNORECASE has a non-zero value, then string comparisons and pattern matching in rules, field splitting with FS and FPAT, record separating with RS, regular expression matching with ~ and !~, and the gensub(), gsub(), index(), match(), patsplit(), split(), and sub() built-in functions all ignore case when doing regular expression operations. NOTE: Array subscripting is not affected. However, the asort() and asorti() functions are affected. Thus, if IGNORECASE is not equal to zero, /aB/ matches all of the strings "ab", "aB", "Ab", and "AB". As with all AWK variables, the initial value of IGNORECASE is zero, so all regular expression and string operations are normally case-sensitive.

LINT Provides dynamic control of the --lint option from within an AWK program. When true, gawk prints lint warnings. When false, it does not. The values allowed for the --lint option may also be assigned to LINT, with the same effects. Any other true value just prints warnings.

NF The number of fields in the current input record.

NR The total number of input records seen so far.

OFMT The output format for numbers, "%.6g", by default.

OFS The output field separator, a space by default.

ORS The output record separator, by default a newline.

PREC The working precision of arbitrary precision floating- point numbers, 53 by default.

PROCINFO The elements of this array provide access to information about the running AWK program. On some systems, there may be elements in the array, "group1" through "groupn" for some n, which is the number of supplementary groups that the process has. Use the in operator to test for these elements. The following elements are guaranteed to be available:

PROCINFO["argv"] The command line arguments as received by gawk at the C-language level. The subscripts start from zero.

PROCINFO["egid"] The value of the getegid(2) system call.

PROCINFO["errno"] The value of errno(3) when ERRNO is set to the associated error message.

PROCINFO["euid"] The value of the geteuid(2) system call.

PROCINFO["FS"] "FS" if field splitting with FS is in effect, "FPAT" if field splitting with FPAT is in effect, "FIELDWIDTHS" if field splitting with FIELDWIDTHS is in effect, or "API" if API input parser field splitting is in effect.

PROCINFO["gid"] The value of the getgid(2) system call.

PROCINFO["identifiers"] A subarray, indexed by the names of all identifiers used in the text of the AWK program. The values indicate what gawk knows about the identifiers after it has finished parsing the program; they are not updated while the program runs. For each identifier, the value of the element is one of the following:

"array" The identifier is an array.

"builtin" The identifier is a built-in function.

"extension" The identifier is an extension function loaded via @load or --load.

"scalar" The identifier is a scalar.

"untyped" The identifier is untyped (could be used as a scalar or array, gawk doesn't know yet).

"user" The identifier is a user-defined function.

PROCINFO["pgrpid"] The value of the getpgrp(2) system call.

PROCINFO["pid"] The value of the getpid(2) system call.

PROCINFO["platform"] A string indicating the platform for which gawk was compiled. It is one of:

"djgpp", "mingw" Microsoft Windows, using either DJGPP, or MinGW, respectively.

"os2" OS/2.

"posix" GNU/Linux, Cygwin, Mac OS X, and legacy Unix systems.

"vms" OpenVMS or Vax/VMS.

PROCINFO["ppid"] The value of the getppid(2) system call.

PROCINFO["strftime"] The default time format string for strftime(). Changing its value affects how strftime() formats time values when called with no arguments.

PROCINFO["uid"] The value of the getuid(2) system call.

PROCINFO["version"] The version of gawk.

The following elements are present if loading dynamic extensions is available:

PROCINFO["api_major"] The major version of the extension API.

PROCINFO["api_minor"] The minor version of the extension API.

The following elements are available if MPFR support is compiled into gawk:

PROCINFO["gmp_version"] The version of the GNU GMP library used for arbitrary precision number support in gawk.

PROCINFO["mpfr_version"] The version of the GNU MPFR library used for arbitrary precision number support in gawk.

PROCINFO["prec_max"] The maximum precision supported by the GNU MPFR library for arbitrary precision floating-point numbers.

PROCINFO["prec_min"] The minimum precision allowed by the GNU MPFR library for arbitrary precision floating-point numbers.

The following elements may set by a program to change gawk's behavior:

PROCINFO["NONFATAL"] If this exists, then I/O errors for all redirections become nonfatal.

PROCINFO["name", "NONFATAL"] Make I/O errors for name be nonfatal.

PROCINFO["command", "pty"] Use a pseudo-tty for two-way communication with command instead of setting up two one-way pipes.

PROCINFO["input", "READ_TIMEOUT"] The timeout in milliseconds for reading data from input, where input is a redirection string or a filename. A value of zero or less than zero means no timeout.

PROCINFO["input", "RETRY"] If an I/O error that may be retried occurs when reading data from input, and this array entry exists, then getline returns -2 instead of following the default behavior of returning -1 and configuring input to return no further data. An I/O error that may be retried is one where errno(3) has the value EAGAIN, EWOULDBLOCK, EINTR, or ETIMEDOUT. This may be useful in conjunction with PROCINFO["input", "READ_TIMEOUT"] or in situations where a file descriptor has been configured to behave in a non-blocking fashion.

PROCINFO["sorted_in"] If this element exists in PROCINFO, then its value controls the order in which array elements are traversed in for loops. Supported values are "@ind_str_asc", "@ind_num_asc", "@val_type_asc", "@val_str_asc", "@val_num_asc", "@ind_str_desc", "@ind_num_desc", "@val_type_desc", "@val_str_desc", "@val_num_desc", and "@unsorted". The value can also be the name (as a string) of any comparison function defined as follows:

function cmp_func(i1, v1, i2, v2)

where i1 and i2 are the indices, and v1 and v2 are the corresponding values of the two elements being compared. It should return a number less than, equal to, or greater than 0, depending on how the elements of the array are to be ordered.

ROUNDMODE The rounding mode to use for arbitrary precision arithmetic on numbers, by default "N" (IEEE-754 roundTiesToEven mode). The accepted values are:

"A" or "a" for rounding away from zero. These are only available if your version of the GNU MPFR library supports rounding away from zero.

"D" or "d" for roundTowardNegative.

"N" or "n" for roundTiesToEven.

"U" or "u" for roundTowardPositive.

"Z" or "z" for roundTowardZero.

RS The input record separator, by default a newline.

RT The record terminator. Gawk sets RT to the input text that matched the character or regular expression specified by RS.

RSTART The index of the first character matched by match(); 0 if no match. (This implies that character indices start at one.)

RLENGTH The length of the string matched by match(); -1 if no match.

SUBSEP The string used to separate multiple subscripts in array elements, by default "\034".

SYMTAB An array whose indices are the names of all currently defined global variables and arrays in the program. The array may be used for indirect access to read or write the value of a variable:

foo = 5 SYMTAB["foo"] = 4 print foo # prints 4

The typeof() function may be used to test if an element in SYMTAB is an array. You may not use the delete statement with the SYMTAB array, nor assign to elements with an index that is not a variable name.

TEXTDOMAIN The text domain of the AWK program; used to find the localized translations for the program's strings.

Arrays Arrays are subscripted with an expression between square brackets ([ and ]). If the expression is an expression list (expr, expr ...) then the array subscript is a string consisting of the concatenation of the (string) value of each expression, separated by the value of the SUBSEP variable. This facility is used to simulate multiply dimensioned arrays. For example:

i = "A"; j = "B"; k = "C" x[i, j, k] = "hello, world\n"

assigns the string "hello, world\n" to the element of the array x which is indexed by the string "A\034B\034C". All arrays in AWK are associative, i.e., indexed by string values.

The special operator in may be used to test if an array has an index consisting of a particular value:

if (val in array) print array[val]

If the array has multiple subscripts, use (i, j) in array.

The in construct may also be used in a for loop to iterate over all the elements of an array. However, the (i, j) in array construct only works in tests, not in for loops.

An element may be deleted from an array using the delete statement. The delete statement may also be used to delete the entire contents of an array, just by specifying the array name without a subscript.

gawk supports true multidimensional arrays. It does not require that such arrays be ``rectangular'' as in C or C++. For example:

a[1] = 5 a[2][1] = 6 a[2][2] = 7

NOTE: You may need to tell gawk that an array element is really a subarray in order to use it where gawk expects an array (such as in the second argument to split()). You can do this by creating an element in the subarray and then deleting it with the delete statement.

Namespaces Gawk provides a simple namespace facility to help work around the fact that all variables in AWK are global.

A qualified name consists of a two simple identifiers joined by a double colon (::). The left-hand identifier represents the namespace and the right-hand identifier is the variable within it. All simple (non-qualified) names are considered to be in the ``current'' namespace; the default namespace is awk. However, simple identifiers consisting solely of uppercase letters are forced into the awk namespace, even if the current namespace is different.

You change the current namespace with an @namespace "name" directive.

The standard predefined builtin function names may not be used as namespace names. The names of additional functions provided by gawk may be used as namespace names or as simple identifiers in other namespaces. For more details, see GAWK: Effective AWK Programming.

Variable Typing And Conversion Variables and fields may be (floating point) numbers, or strings, or both. They may also be regular expressions. How the value of a variable is interpreted depends upon its context. If used in a numeric expression, it will be treated as a number; if used as a string it will be treated as a string.

To force a variable to be treated as a number, add zero to it; to force it to be treated as a string, concatenate it with the null string.

Uninitialized variables have the numeric value zero and the string value "" (the null, or empty, string).

When a string must be converted to a number, the conversion is accomplished using strtod(3). A number is converted to a string by using the value of CONVFMT as a format string for sprintf(3), with the numeric value of the variable as the argument. However, even though all numbers in AWK are floating-point, integral values are always converted as integers. Thus, given

CONVFMT = "%2.2f" a = 12 b = a ""

the variable b has a string value of "12" and not "12.00".

NOTE: When operating in POSIX mode (such as with the --posix option), beware that locale settings may interfere with the way decimal numbers are treated: the decimal separator of the numbers you are feeding to gawk must conform to what your locale would expect, be it a comma (,) or a period (.).

Gawk performs comparisons as follows: If two variables are numeric, they are compared numerically. If one value is numeric and the other has a string value that is a 'numeric string,' then comparisons are also done numerically. Otherwise, the numeric value is converted to a string and a string comparison is performed. Two strings are compared, of course, as strings.

Note that string constants, such as "57", are not numeric strings, they are string constants. The idea of 'numeric string' only applies to fields, getline input, FILENAME, ARGV elements, ENVIRON elements and the elements of an array created by split() or patsplit() that are numeric strings. The basic idea is that user input, and only user input, that looks numeric, should be treated that way.

Octal and Hexadecimal Constants You may use C-style octal and hexadecimal constants in your AWK program source code. For example, the octal value 011 is equal to decimal 9, and the hexadecimal value 0x11 is equal to decimal 17.

String Constants String constants in AWK are sequences of characters enclosed between double quotes (like "value"). Within strings, certain escape sequences are recognized, as in C. These are:

\\ A literal backslash.

\a The 'alert' character; usually the ASCII BEL character.

\b Backspace.

\f Form-feed.

\n Newline.

\r Carriage return.

\t Horizontal tab.

\v Vertical tab.

\xhex digits The character represented by the string of hexadecimal digits following the \x. Up to two following hexadecimal digits are considered part of the escape sequence. E.g., "\x1B" is the ASCII ESC (escape) character.

\ddd The character represented by the 1-, 2-, or 3-digit sequence of octal digits. E.g., "\033" is the ASCII ESC (escape) character.

\c The literal character c.

In compatibility mode, the characters represented by octal and hexadecimal escape sequences are treated literally when used in regular expression constants. Thus, /a\52b/ is equivalent to /a\*b/.

Regexp Constants A regular expression constant is a sequence of characters enclosed between forward slashes (like /value/). Regular expression matching is described more fully below; see Regular Expressions.

The escape sequences described earlier may also be used inside constant regular expressions (e.g., /[ \t\f\n\r\v]/ matches whitespace characters).

Gawk provides strongly typed regular expression constants. These are written with a leading @ symbol (like so: @/value/). Such constants may be assigned to scalars (variables, array elements) and passed to user-defined functions. Variables that have been so assigned have regular expression type.