Using awk

Using strings and string functions

A string constant is created by enclosing a sequence of characters inside quotation marks, as in ``abc'' or ``hello, everyone''. String constants can contain the C programming language escape sequences for special characters listed in ``Regular expressions''.

String expressions are created by concatenating constants, variables, field names, array elements, functions, and other expressions. The following program prints each record preceded by its record number and a colon, with no blanks:

   { print NR ":" $0 }

This concatenates the three strings representing the record number, the colon, and the record, and prints the resulting string.

awk provides the built-in string functions shown in ``awk string functions''. In this table, r represents a regular expression, s and t are string expressions, and n and p are integers.

awk string functions

Function Description

getline reads next line of input

gsub(r,s) substitutes s for r globally in current record, returns number of substitutions

gsub(r,s,t) substitutes s for r globally in string t, returns number of substitutions

index(s,t) returns position of string t in s, 0 if not present

length(s) returns length of s

match(s,r) returns the position in s where r occurs, 0 if not present; see built-in variables RSTART and RLENGTH

split(s,a) splits s into array a on FS, returns number of fields

split(s,a,r) splits s into array a on r, returns number of fields

sprintf(fmt,expr-list) returns expr-list formatted according to format string fmt

sub(r,s) substitutes s for first r in current record, returns number of substitutions

sub(r,s,t) substitutes s for first r in t, returns number of substitutions

substr(s,p) returns suffix of s starting at position p

substr(s,p,n) returns substring of s of length n starting at position p

tolower(s) returns s translated into lowercase

toupper(s) returns s translated into uppercase

The getline function is used to read the next input line. Note that it does not return a value and that its syntax is like that of a statement: appending parentheses to it causes an error.

Function	Description
getline	reads next line of input
gsub(*r,s*)	substitutes s for r globally in current record, returns number of substitutions
gsub(*r,s,t*)	substitutes s for r globally in string t, returns number of substitutions
index(*s,t*)	returns position of string t in s, 0 if not present
length(s)	returns length of s
match(*s,r*)	returns the position in s where r occurs, 0 if not present; see built-in variables RSTART and RLENGTH
split(*s,a*)	splits s into array a on FS, returns number of fields
split(*s,a,r*)	splits s into array a on r, returns number of fields
sprintf(*fmt,expr-list*)	returns *expr-list* formatted according to format string *fmt*
sub(*r,s*)	substitutes s for first r in current record, returns number of substitutions
sub(*r,s,t*)	substitutes s for first r in t, returns number of substitutions
substr(*s,p*)	returns suffix of s starting at position p
substr(*s,p,n*)	returns substring of s of length n starting at position p
tolower(s)	returns s translated into lowercase
toupper(s)	returns s translated into uppercase

   { print "skipping record for ",$1
     getline
     print "going to record for ",$1 }

This code reads a record, prints the specified string, then executes the getline function which passes control onto the next record without processing:

   skipping record for CIS
   going to record for Canada
   skipping record for China
   ...

For more information on getline, see ``Multiline records and the getline function''.

The functions sub and gsub are patterned after the substitute command in the text editor ed(C). The function gsub(r,s,t) replaces successive occurrences of substrings matched by the regular expression r with the replacement string s in the target string t. (As in ed, the left-most match is used and is made as long as possible.) gsub returns the number of substitutions made. The function gsub(r,s) is a synonym for gsub(r,s,$0). For example, the following program transcribes its input, replacing occurrences of ``USA'' with ``United States'':

   { gsub(/USA/, "United States"); print }

Note that replacing the order of the commands in this action has an unexpected effect:

   { print gsub(/USA/, "United States",$0) }

The exit value of the operation as performed on each record is displayed:

In this case, only the fourth record of countries contains the string ``USA'': all other records return an exit value of 0.

The sub functions are similar to gsub, except that they only replace the first matching substring in the target string.

The function index(s,t) returns the left-most position where the string t begins in s, or zero if t does not occur in s. The first character in a string is at position 1. For example, the following command returns 2:

   { print index("banana", "an") }

The length function returns the number of characters in its argument string; thus, the following prints each record, preceded by its length:

   { print length($0), $0 }

($0 includes the input record separator but not the trailing newlines.) The following program prints the longest country name (``Australia''):

   length($1) > max  { max = length($1); name = $1 }
   END               { print name }

The match(s,r) function returns the position in string s where regular expression r occurs, or 0 if it does not occur. This function also sets two built-in variables RSTART and RLENGTH. RSTART is set to the starting position of the match in the string; this is the same value as the returned value. RLENGTH is set to the length of the matched string. (If a match does not occur, RSTART is 0, and RLENGTH is -1.) For example, the following program finds the first occurrence of the letter ``i,'' followed by at most one character, followed by the letter ``a'' in a record:

   { if (match($0, /i.?a/))
         print RSTART, RLENGTH, $0 }

This program produces the following output from the file countries:

   16	2	CIS		8650	262	Asia
   26	3	Canada		3852	24	North America
   3	3	China		3692	866	Asia
   24	3	USA		3615	219	North America
   27	3	Brazil		3286	116	South America
   8	2	Australia	2968	14	Australia
   4	2	India		1269	637	Asia
   7	3	Argentina	1072	26	South America
   17	3	Sudan		968	19	Africa
   6	2	Algeria		920	18	Africa

Note that the match function matches the left-most longest matching string. For example, if you use the string ``AsiaaaAsiaaaaan'' as an input record, the following program matches the first string of a's and sets RSTART to 4 and RLENGTH to 3:

   { if (match($0, /a+/))  print RSTART, RLENGTH, $0 }

Consider the following function:

sprintf(format, expr1, expr2, ...)

returns (without printing) a string containing the following, formatted according to the printf specifications in the string format:

expr1, expr2, ..., exprn

For a complete specification of these format conventions, see ``The printf statement''.

The following statement assigns to x the string produced by formatting the values of $1 and $2:

   x = sprintf("%10s %6d", $1, $2)

It is assigned as a 10-character string and a decimal number in a field of width at least six; x can be used in any subsequent computation or display operation. For example:

   { x=sprintf("%10s%6d",$1,$2); print x }

This program produces the following output:

   CIS               8650
   Canada            3852
   China             3692
   USA               3615
   Brazil            3286
   Australia         2968
   India             1269
   Argentina         1072
   Sudan              968
   Algeria            920
   CIS               8650
   Canada            3852
   China             3692
   USA               3615
   Brazil            3286
   Australia         2968
   India             1269
   Argentina         1072
   Sudan              968
   Algeria            920

The function substr(s,p,n) returns the substring of s that begins at position p and is at most n characters long. If substr(s,p) is used, the substring goes to the end of s; that is, it consists of the suffix of s beginning at position p. For example, we could abbreviate the country names in countries to their first three characters by invoking the following program:

{ $1 = substr($1, 1, 3); print }

This produces the following output:

   CIS 8650 262 Asia
   Can 3852 24 North America
   Chi 3692 866 Asia
   USA 3615 219 North America
   Bra 3286 116 South America
   Aus 2968 14 Australia
   Ind 1269 637 Asia
   Arg 1072 26 South America
   Sud 968 19 Africa
   Alg 920 18 Africa

Note that setting $1 in the program forces awk to recompute $0 and, therefore, the fields are separated by blanks (the default value of OFS), not by tabs. Attempting to change the setting of OFS back to a tab character with the command { OFS="\t" } has the following result (only the first two lines are shown):

   CIS     8650    262     Asia
   Can     3852    24      North America

Note that this has had the undesirable effect of tab-separating ``North'' and ``America'' as well as the genuine fields.

Strings are stuck together (concatenated) by writing them one after another in an expression. For example, consider the following program:

        { s = s substr($1, 1, 3) " " }
   END  { print s }

When invoked on the file countries, the program prints the following by building s up, one piece at a time, from an initially empty string:

   CISCanChiUSABraAusIndArgSudAlg