Programming with awk

Strings and string functions

A string constant is created by enclosing a sequence of characters inside quotation marks, as in ``"abc"'' or ``"hello, everyone"''. String constants may contain the C programming language escape sequences for special characters listed in ``Extended regular expressions''.

String expressions are created by concatenating constants, variables, field names, array elements, functions, and other expressions. The program

   { print NR ":" $0 }

prints each record preceded by its record number and a colon, with no blanks. The three strings representing the record number, the colon, and the record are concatenated and the resulting string is printed. The concatenation operator has no explicit representation other than juxtaposition.

awk provides the built-in string functions shown in ``awk built-in string functions''. In this table, r represents an extended regular expression (either as a string or as /r/), s and t string expressions, and n and p integers.

awk built-in string functions

Function Description

gsub(r,s) substitute s for r globally in current record, return number of substitutions

gsub(r,s,t) substitute s for r globally in string t, return number of substitutions

index(s,t) return position of string t in s, 0 if not present

length(s) return length of s

match(s,r) return the position in s where r occurs, 0 if not present

split(s,a) split s into array a on FS, return number of fields

split(s,a,r) split s into array a on r, return number of fields

sprintf(fmt,expr-list) return expr-list formatted according to format string fmt

sub(r,s) substitute s for first r in current record, return number of substitutions

sub(r,s,t) substitute s for first r in t, return number of substitutions

substr(s,p) return substring of s starting at position p

substr(s,p,n) return substring of s of length n starting at position p

tolower(s) return a string in which each upper case character in string s is replaced by a lower case character

toupper(s) return a string in which each lower case character in string s is replaced by an upper case character

Function	Description
`gsub(r,s)`	substitute s for r globally in current record, return number of substitutions
`gsub(r,s,t)`	substitute s for r globally in string t, return number of substitutions
`index(s,t)`	return position of string t in s, 0 if not present
`length(s)`	return length of s
`match(s,r)`	return the position in s where r occurs, 0 if not present
`split(s,a)`	split s into array a on FS, return number of fields
`split(s,a,r)`	split s into array a on r, return number of fields
`sprintf(fmt,expr-list)`	return *expr-list* formatted according to format string *fmt*
`sub(r,s)`	substitute s for first r in current record, return number of substitutions
`sub(r,s,t)`	substitute s for first r in t, return number of substitutions
`substr(s,p)`	return substring of s starting at position p
`substr(s,p,n)`	return substring of s of length n starting at position p
`tolower(s)`	return a string in which each upper case character in string s is replaced by a lower case character
`toupper(s)`	return a string in which each lower case character in string s is replaced by an upper case character

The functions sub and gsub are patterned after the substitute command in the text editor ed(C). The function gsub(r,s,t) replaces successive occurrences of substrings matched by the extended regular expression r with the replacement string s in the target string t. (As in ed, the leftmost match is used, and is made as long as possible.) It returns the number of substitutions made. The function gsub(r,s) is a synonym for gsub(r,s,,$0). For example, the program

   { gsub(/USA/, "United States"); print }

transcribes its input, replacing occurrences of USA by United States. The sub functions are similar, except that they only replace the first matching substring in the target string.

The function index(s,t) returns the leftmost position where the string t begins
in s, or zero if t does not occur in s. The first character in a string is at position 1. For example,

   index("banana", "an")

returns 2.

The length function returns the number of characters in its argument string; thus,

   { print length($0), $0 }

prints each record, preceded by its length. ($0 does not include the input record separator.) The program

   length($1) > max  { max = length($1); name = $1 }
   END               { print name }

when applied to the file countries, prints the longest country name:
Australia.

The match(s,r) function returns the position in string s where extended regular expression r occurs, or 0 if it does not occur. This function also sets two built-in variables RSTART and RLENGTH. RSTART is set to the starting position of the match in the string; this is the same value as the returned value. RLENGTH is set to the length of the matched string. (If a match does not occur, RSTART is 0, and RLENGTH is -1.) For example, the following program finds the first occurrence of the letter i followed by at most one character followed by the letter a in a record:

   { if (match($0, /i.?a/))
         print RSTART, RLENGTH, $0 }

It produces the following output on the file countries:

   17 2 USSR       8650    262     Asia
   26 3 Canada     3852     24     North America
    3 3 China      3692    866     Asia
   24 3 USA        3615    219     North America
   27 3 Brazil     3286    116     South America
    8 2 Australia  2968     14     Australia
    4 2 India      1269    637     Asia
    7 3 Argentina  1072     26     South America
   17 3 Sudan       968     19     Africa
    6 2 Algeria     920     18     Africa

NOTE: match matches the leftmost longest matching string. For example, with the record

AsiaaaAsiaaaaan

as input, the program

{ if (match($0, /a+/)) print RSTART, RLENGTH, $0 }

matches the first string of a's and sets RSTART to 4 and RLENGTH to 3.

The function sprintf(format, expr[1], expr[2], . . ., expr[n]) returns (without printing) a string containing expr[1], expr[2], . . ., expr[n] formatted according to the printf specifications in the string format. ``The printf statement'' contains a complete specification of the format conventions. The statement

   x = sprintf("%10s %6d", $1, $2)

assigns to x the string produced by formatting the values of $1 and $2 as a ten-character string and a decimal number in a field of width at least six; x may be used in any subsequent computation.

The function substr(s,p,n) returns the substring of s that begins at position p and is at most n characters long. If substr(s,p) is used, the substring goes to the end of s; that is, it consists of the suffix of s beginning at position p. For example, we could abbreviate the country names in countries to their first three characters by invoking the program

   { $1 = substr($1, 1, 3); print }

on this file to produce

   USS 8650 262 Asia
   Can 3852  24 North America
   Chi 3692 866 Asia
   USA 3615 219 North America
   Bra 3286 116 South America
   Aus 2968  14 Australia
   Ind 1269 637 Asia
   Arg 1072  26 South America
   Sud  968  19 Africa
   Alg  920  18 Africa

Note that setting $1 in the program forces awk to recompute $0 and, therefore, the fields are separated by blanks (the default value of OFS), not by tabs.

Strings are stuck together (concatenated) merely by writing them one after another in an expression. For example, when invoked on the file countries,

        { s = s substr($1, 1, 3) " " }
   END  { print s }

prints

   USS Can Chi USA Bra Aus Ind Arg Sud Alg

by building s up a piece at a time from an initially empty string.