Ch 4 -- Awk
UNIX Unleashed, Internet Edition
- 4 -
Awk
By Ann Marshall and David B. Horvath
The UNIX utility awk is a pattern-matching and processing language with considerably
more power than you might realize. It searches one or more specified files, checking
for records that match a specified pattern. If awk finds a match, the corresponding
action is performed. Awk is a simple concept, but it is a powerful tool. Often, an
awk program is only a few lines long, and because of this, an awk program is often
written, used, and discarded. A traditional programming language, such as Pascal
or C, would take more thought, more lines of code, and hence, more time.
Short awk programs arise from two of awk's built-in features: the amount of predefined
flexibility and the number of details automatically handled by the language. Together,
these features allow the manipulation of large data files in short (often single-line)
programs and make awk stand apart from other programming languages. Certainly, any
time you spend learning awk will pay dividends in improved productivity and efficiency.
Uses
The uses for awk vary from the simple to the complex. Originally, awk was intended
for various kinds of data manipulation. Intentionally omitting parts of a file, counting
occurrences in a file, and writing reports are natural uses for awk.
Awk uses the syntax of the C programming language; so if you know C, you have
an idea of awk syntax. If you are new to programming or don't know C, learning awk
will familiarize you with many of the C constructs.
Examples of where awk can be helpful abound. Computer-aided manufacturing, for
example, is plagued with nonstandardization, so the output of a computer that's running
a particular tool is quite likely to be incompatible with the input required for
a different tool. Rather than write any complex C program, this type of simple data
transformation is a perfect awk task.
One problem of computer-aided manufacturing today is that no standard format yet
exists for the program running the machine. Therefore, the output from computer A
running machine A probably is not the input needed for computer B running machine
B. Although machine A is finished with the material, machine B is not ready to accept
it. Production halts while someone edits the file so it meets computer B's needed
format. This is a perfect and simple awk task.
Due to the amount of built-in automation within awk, it is also useful for rapid
prototyping or trying out an idea that could later be implemented in another language.
Awk works with text files, not binary files. Because binary data can contain values
that look like record terminators (newline characters)--or not have any at the end
of the record--awk will get confused. If you need to process binary files, look into
Perl or use a traditional programming language such as C.
Features
Reflecting the UNIX environment, awk features resemble the structures of both
C and shell scripts. Highlights include flexibility, predefined variables, automation,
standard program constructs, conventional variable types, powerful output formatting
borrowed from C, and ease of use.
The flexibility means that most tasks may be done more than one way in awk. With
the application in mind, the programmer chooses which method to use. The built-in
variables already provide many of the tools to do what is needed. Awk is highly automated.
For instance, awk automatically retrieves each record, separates it into fields,
and does type conversion when needed, without programmer's request. Furthermore,
there are no variable declarations. Awk includes the usual programming constructs
for the control of program flow: an if statement for two-way decisions and
do, for, and while statements for looping. Awk also includes
its own notational shorthand to ease typing. (This is UNIX after all!) Awk borrows
the printf() statement from C to allow "pretty" and versatile
formats for output. These features combine to make awk user-friendly.
A Brief History
Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan created awk in 1977.
The name comes from the last initials of the creators. These are some of the same
people who created the UNIX operating system and the C programming language. You
will see many similarities between awk and C, largely for that reason.
In 1985, more features were added, creating nawk (new awk). For quite a while,
nawk remained exclusively the property of AT&T, Bell Labs. Although it became
part of System V for Release 3.1, some versions of UNIX, such as SunOS, keep both
awk and nawk due to a syntax incompatibility. Others, such as System V, run nawk
under the name awk (although System V has nawk too). In The Free Software Foundation,
GNU introduced their version of awk--gawk--based on the IEEE POSIX (Institute of
Electrical and Electronics Engineers, Inc., IEEE Standard for Information Technology,
Portable Operating System Interface, Part 2: Shell and Utilities Volume 2, ANSI approved
4/5/93), awk standard, which is different from awk or nawk. Linux PC shareware UNIX
uses gawk rather than awk or nawk. Throughout this chapter, the word awk is used
when any of the three (new awk, POSIX awk, or gawk) will do. The versions are mostly
upwardly compatible. Awk is the oldest, then nawk, then POSIX awk, and then gawk
as shown in Figure 4.1. I have used the notation version++ to denote a concept
that began in that version and continues through any later versions.
NOTE: Due to different syntax, not all
code written in the original awk language will run under nawk, POSIX awk, or gawk.
However, except when noted, all the concepts of awk are implemented in nawk and gawk.
Where it matters, the version is specified. If an example does not work using the
awk command, try nawk.
Figure 4.1.
The evolution of awk.
Refer to the end of the chapter for more information and further resources on
awk and its derivatives.
Fundamentals
This section introduces the basics of the awk programming language. One feature
of awk that almost continually holds true is this: You can do most tasks more than
one way. The command line exemplifies this. First, I explain the variety of ways
awk can be called from the command line--using files for input, the program file,
and possibly an output file. Next, I introduce the main construct of awk, which is
the pattern action statement. Then, I explain the fundamental ways awk can read and
transform input. I conclude the section with a look at the format of an awk program.
Entering Awk from the Command Line
In its simplest form, awk takes the material you want to process from standard
input and displays the results to standard output (the monitor). You write the awk
program on the command line.
You can either specify explicit awk statements on the command line, or, with the
-f flag, specify an awk program file that contains a series of awk
commands. In addition to the standard UNIX design allowing for standard input and
output, you can, of course, use file redirection in your shell, too; so awk <
inputfile is functionally identical to awk inputfile. To save the output
in a file, the file redirection awk > outputfile does the trick. Awk
can work with multiple input files at once if they are specified on the command line.
The most common way use awk is as part of a command pipe, where it's filtering
the output of a command. An example is ls -l | awk '{print $3}', which would
print just the third column of each line of the ls command. Awk scripts
can become quite complex, so if you have a standard set of filter rules that you
would like to apply to a file, with the output sent directly to the printer, you
could use something like awk -f myawkscript inputfile | lp.
TIP: To specify your awk script on the
command line, it is best to use single quotes to let you embed spaces and to ensure
that the command shell does not interpret any special characters in the awk script.
Files for Input
Input and output places can be changed. You can specify an input file by typing
the name of the file after the program with a blank space between the two. The input
file enters the awk environment from your workstation keyboard (standard input).
To signal the end of the input file, type Ctrl-D. The program on the command line
executes on the input file you just entered and the results are displayed on the
monitor (the standard output).
Here's a simple little awk command that echoes all lines I type, prefacing each
with the number of words (or fields, in awk parlance, hence the NF variable
for number of fields) in the line.
Note that Ctrl-D means that while holding down the Control key, you should press
the D key.
$ awk '{print NF ": " $0}'
I am testing my typing.
A quick brown fox jumps when vexed by lazy ducks.
Ctrl+D
5: I am testing my typing.
10: A quick brown fox jumps when vexed by lazy ducks.
$ _
You can also name more than one input file on the command line, causing the combined
files to act as one input. This is one way of having multiple runs through one input
file.
TIP: Keep in mind that the correct ordering
on the command line is crucial for your program to work correctly; files are read
from left to right, so if you want to have file1 and file2 read
in that order, you'll need to specify them as such on the command line.
The Program File
With awk's automatic type conversion, a file of names and a file of numbers entered
in the reverse order at the command line generate strange-looking output rather than
an error message. That is why, for longer programs, it is simpler to put the program
in a file and specify the name of the file on the command line. The -f option
does this. Notice that this is an exception to the usual way UNIX handles options.
Usually, the options occur at the end of a command; however, here, an input file
is the last parameter.
NOTE: Versions of awk that meet the POSIX
awk specifications are allowed to have multiple -f options. You can use
this capability for running multiple programs using the same input.
Specifying Output on the Command Line
Output from awk may be redirected to a file or piped to another program. (See
Chapter 4, Volume I, "The UNIX File System.") The command awk '/^5/
{print $0}' | grep 3, for example, will result in just those lines that start
with the digit 5 (that's what the awk part does) and also contain the digit
3 (the grep command). If you wanted to save that output to a file,
by contrast, you could use awk '/^5/ {print $0}' > results and the file
results would contain all lines prefaced by the digit 5. If you
opt for neither of these courses, the output of awk will be displayed on your screen
directly, which can be quite useful in many instances, particularly when you're developing
or fine-tuning your awk script.
Patterns and Actions
Awk programs are divided into three main blocks: the BEGIN block, the
per-statement processing block, and the END block. Unless explicitly stated,
all statements to awk appear in the per-statement block. (You'll see later where
the other blocks can come in particularly handy for programming, though.)
Statements within awk are divided into two parts: a pattern, telling awk what
to match, and a corresponding action, telling awk what to do when a line matching
the pattern is found. The action part of a pattern-action statement is enclosed in
curly braces ({}) and can be multiple statements. Either part of a pattern
action statement may be omitted. An action with no specified pattern matches every
record of the input file you want to search. (That's how the earlier example of {print
$0} worked.) A pattern without an action indicates that you want input records
to be copied to the output file as they are (as in printed).
/^5/ {print $0} is an example of a two-part statement. The pattern is
all lines that begin with the digit 5. (The ^ indicates that it
should appear at the beginning of the line; without this modifier, the pattern would
say any line that includes the digit 5.) The action prints the entire line,
verbatim. ($0 is shorthand for the entire line.)
Input
Awk automatically scans, in order, each record of the input file looking for each
pattern action statement in the awk program. Unless otherwise set, awk assumes each
record is a single line. (See the sections "Advanced Concepts," "Multiline
Records" in this chapter for how to change this.) If the input file has blank
lines in it, the blank lines count as a record too. Awk automatically retrieves each
record for analysis; there is no read statement in awk.
A programmer can also disrupt the automatic input order in of two ways: with the
next and exit statements. The next statement tells awk
to retrieve the next record from the input file and continue, without running the
current input record, through the remaining portion of pattern-action statements
in the program. For example, if you are doing a crossword puzzle and all the letters
of a word are formed by previous words, most likely you wouldn't even bother to read
that clue but simply skip to the clue below; this is how the next statement
would work, if your list of clues were the input. The other method of disrupting
the usual flow of input is through the exit statement. The exit
statement transfers control to the END block--if one is specified--or quits
the program, as if all the input has been read. Suppose the arrival of a friend ends
your interest in the crossword puzzle, but you still put the paper away. Within the
END block, an exit statement causes the program to quit.
An input record refers to the entire line of a file including any characters,
spaces, or tabs. The spaces and tabs are called whitespace.
TIP: If you think that your input file
includes both spaces and tabs, you can save yourself a lot of confusion by ensuring
that all tabs become spaces with the expand command. It works like this:
expand filename | awk '{ stuff }'. If your system does not have expand,
you can use pr -e.
The whitespace in the input file and the whitespace in the output file are not
related; you must explicitly put whitespace in your output file.
Fields
A group of characters in the input record or output file is called a field.
Fields are predefined in awk: $1 is the first field, $2 is the
second, $3 is the third, and so on. $0 indicates the entire line.
Fields are separated by a field separator (any single character including Tab)
held in the variable FS. Unless you change it, FS has a space
as its value. You can change FS by either starting the program file with
the following statement:
BEGIN {FS = "c" }
or by setting the -Fc command-line option where "c"
and c are the single selected field separator characters you want to use.
One file that you might have viewed, which demonstrates where changing the field
separator could be helpful, is the /etc/passwd file that defines all user
accounts. Rather than having the different fields separated by spaces or tabs, the
password file is structured with lines that look like:
nttp://?:6:11:USENET nttp:///usr/spool/nttp:///bin/ksh
Each field is separated by a colon. You could change each colon to a space (with
sed, for example), but that wouldn't work too well. The fifth field, USENET
News, contains a space already. You should change the field separator. If you
wanted to have a list of the fifth fields in each line, for example, you could use
the simple awk command awk -F: '{print $5}' /etc/passwd.
Likewise, the built-in variable OFS holds the value of the output field
separator. OFS also has a default value of a space. It, too, may be changed
by placing the following line at the start of a program.
BEGIN {OFS = "c" }
If you wanted to automatically translate the /etc/passwd file so that
it listed only the first and fifth fields, separated by a tab, you would use the
awk script:
BEGIN { FS=":" ; OFS=" " } # Use the tab key for OFS
{ print $1, $5 }
The script contains two blocks: the BEGIN block and the main per-input
line block. Also, most of the work is done automatically.
Program Format
With a few noted exceptions, awk programs are free format. The interpreter ignores
any blank lines in a program file (also known as an awk script). Add blank lines
to improve the readability of your program. The same is true for tabs and spaces
between operators and the parts of a program. Therefore, these two lines are treated
identically by the awk interpreter:
$4 == 2 {print "Two"}
$4 == 2 { print "Two" }
If more than one action appears on a line, you'll need to separate the actions
with a semicolon, as shown previously in the BEGIN block for the /etc/passwd
file translator. If you stick with one command per line, you won't need to worry
too much about the semicolons. There are a couple of spots, however, in which the
semicolon must always be used: before an else statement or when included
in the syntax of a statement. (See the "Loops" or "The Conditional
Statement" sections in this chapter.)
Putting a semicolon at the end of a statement is useful when you have a C language
background or convert your awk code to a compiled C program.
The other format restriction for awk programs is that at least the opening curly
bracket of the action ( of a pattern action statement) must be on the same line as
the accompanying pattern. Thus, the following examples all do the same thing.
The first shows all statements on one line:
$2==0 {print ""; print ""; print "";}
The second example puts the first statement on the same line as the pattern to
match and the remaining statements on the following lines:
$2==0 { print ""
print ""
print ""}
You can spread out the statements even more by moving the first statement to its
own line. Only the initial (opening) curly bracket has to be on the same line as
the pattern:
$2==0 {
print ""
print ""
print ""
}
When the second field of the input file is equal to 0, awk prints three
blank lines to the output file.
NOTE: Notice that print ""
prints a blank line to the output file, whereas the statement print alone
prints the current input line.
An awk program file might have commentary within. Anything typed from a #
to the end of the line is considered a comment and is ignored by awk. Comments are
notes explaining what is going on in words, not computerese.
A Note on awk Error Messages
Awk error messages (when they appear) tend to be cryptic. Often, due to the brevity
of the program, a typo is easily found. Not all errors are as obvious; I have scattered
some examples of errors throughout this chapter.
Print Selected Fields
Awk includes three ways to specify printing. The first is implied. A pattern
without an action assumes that the action is to print. The two ways of actively commanding
awk to print are print and printf(). For simplicity, only implied
printing and the print statement are shown here. printf is discussed
in a later section titled "Input/Output" and is used mainly for precise
output. This section demonstrates the first two types of printing through some step-by-step
examples.
Program Components
If I wanted to look for a particular user in the /etc/passwd file, I
could enter an awk command to find a match but omit an action. The following command
line puts a list on-screen.
$ awk '/Ann/' /etc/passwd
amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh
andhs26:0TFnZSVwcua3Y:2488:23:DeAnn O'Neal:/usr/lstudent/andhs26:/bin/csh
alewis:VYfz4EatT4OoA:2623:22:Annie Lewis:/usr/lteach/alewis:/bin/csh
cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann McIntyre:/usr/lteach/cmcintyr:/bin/csh
jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn Flanagan:/usr/lteach/jflanaga:/bin/csh
lschultz:mic35ZiFj9zWk:3060:22:Lee Ann Schultz, :/usr/lteach/lschultz:/bin/csh
akestle:job57Lb5/ofoE:3063:22:Ann Kestle.:/usr/lteach/akestle:/bin/csh
bakehs59:yRYV6BtcW7wFg:3075:23:DeAnna Adlington, Baker :/usr/bakehs59:/bin/csh
ahernan:AZZPQNCkw6ffs:3144:23:Ann Hernandez:/usr/lstudent/ahernan:/bin/csh
$ _
I look on the monitor and see the correct spelling.
NOTE: For the sake of making a point,
suppose I had chosen the pattern /Anne/. A quick glance above shows that
there would be no matches. Entering awk '/Anne/' /etc/passwd would produce
nothing but another system prompt to the monitor. This can be confusing if you expect
output. The same goes the other way; above, I wanted the name Ann, but the names
LeAnn, Annie, and DeAnna matched, too. Sometimes choosing a pattern too long or too
short can cause an unneeded headache.
The grep command can perform the same search performed using awk in the
above example. The real power of awk searching comes from searching specific fields
like this:
$ awk -F: '$5 ~ /^Ann*/' /etc/passwd
amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh
alewis:VYfz4EatT4OoA:2623:22:Annie Lewis:/usr/lteach/alewis:/bin/csh
akestle:job57Lb5/ofoE:3063:22:Ann Kestle.:/usr/lteach/akestle:/bin/csh
ahernan:AZZPQNCkw6ffs:3144:23:Ann Hernandez:/usr/lstudent/ahernan:/bin/csh
$ _
I'll discuss more about advanced search strings in the "Patterns" section.
TIP: If a pattern match is not found,
look for a typo in the pattern you are trying to match.
The Input File and Program
Printing specified fields of an ASCII (plain text) file is a straightforward awk
task. Because this program example is so short, only the input is in a file. The
first input file, sales, is a file of car sales by month. The file consists
of each salesperson's name, followed by a monthly sales figure. The end field is
a running total of that person's total sales.
$cat sales
John Anderson,12,23,7,42
Joe Turner,10,25,15,50
Susan Greco,15,13,18,46
Bob Burmeister,8,21,17,46
The following command line prints the salesperson's name and the total sales for
the first quarter.
$ awk -F, '{print $1,$5}' sales
John Anderson 42
Joe Turner 50
Susan Greco 46
Bob Burmeister 46
A comma (,) between field variables indicates that I want OFS
applied between output fields, as shown in a previous example. Remember, without
the comma, no field separator will be used and the displayed output fields (or output
file) will all run together.
TIP: Putting two field separators in a
row inside a print statement creates a syntax error with the print statement;
however, using the same field twice in a single print statement is valid syntax.
For example:
awk '{print($1,$1)}'
Patterns
A pattern is the first half of an awk program statement. In awk, there
are six accepted pattern types. You have already seen a couple of them, including
BEGIN, and a specified, slash-delimited pattern, in use. Awk has many string-matching
capabilities arising from patterns and uses regular expressions in patterns. A range
pattern locates a sequence. All patterns except range patterns may be combined in
a compound pattern.
This section explores exactly what is meant by a pattern match. What kind of pattern
you can match depends on exactly how you're using the awk pattern-specification notation.
BEGIN and END
The two special patterns BEGIN and END may be used to indicate
a match, either before the first input record is read or after the last input record
is read, respectively. Some versions of awk require that, if used, BEGIN
must be the first pattern of the program and, if used, END must be the last
pattern of the program. This is good practice to follow even if the version you use
does not require it. Examples in this chapter will follow this practice. Using the
BEGIN pattern for initializing variables is common (although variables can
be passed from the command line to the program too; see the section "Command-Line
Arguments"). The END pattern is used for things which are input-dependent,
such as totals.
If I wanted to know how many lines were in a given program, I would type the following
line:
$ awk 'END {print "Total lines: " NR}' myprogram
I see Total lines: 256 on the monitor and therefore know that the file
myprogram has 256 lines. At any point while awk is processing the file,
the variable NR counts the number of records read so far. NR at
the end of a file has a value equal to the number of lines in the file.
How might you see a BEGIN block in use? Your first thought might be to
initialize variables, but if something is a numeric value, it's automatically initialized
to 0 before its first use. Instead, perhaps you're building a table of data
and want to have some columnar headings. With this in mind, here's a simple awk script
that shows you all the accounts that people named Dave have on your computer:
BEGIN {
FS=":" # remember that the passwd file uses colons
OFS=" " # we_re-setting the output to a TAB
print "Account", "Username"
}
/Dav/ {print $1, $5}
Here's what it looks like in action (I've called this file daves.awk,
although the program matches Dave and David):
$ awk -f daves.awk /etc/passwd
Account Username
andrews Dave Andrews
d3 David Douglas Dunlap
daves Dave Smith
taylor Dave Taylor
Note that you could also easily have a summary of the total number of matched
accounts by adding a variable that's incremented for each match, and then output
it in the END block output in some manner. Here's one way to do it:
BEGIN { FS=":" ; OFS=" " # input colon separated, output tab separated
print "Account", "Username"
}
/Dav/ {print $1, $5 ; matches++ }
END {print "A total of " matches " matches."}
Here, you can see how awk allows you to shorten the length of programs by having
multiple items on a single line, which is particularly useful for initialization.
Also, notice the C increment notation: matches++ is functionally identical
to matches = matches + 1 and matches += 1. Finally, also note that
I did not initialize the variable matches to 0 because it was done
automatically by the awk system.
Expressions
Any expression can be used with any operator in awk. An expression consists
of any operator in awk and its corresponding operand in the form of a pattern-match
statement. Type conversion--variables being interpreted as numbers at one point,
but strings at another--is automatic but never explicit. The type of operand needed
is decided by the operator type. If a numeric operator is given a string operand,
it is converted, and vice versa.
TIP: To force a conversion, if the desired
change is string to number, add (+) 0. If you want to explicitly convert
a number to a string concatenate "" (the null string) to the variable.
Two quick examples are these: num=3; num=num "" creates a new
numeric variable and sets it to the number three; by appending a null string to it,
it gets translates to a string (the string with the character 3 within).
Adding 0 to the string created by str="3"; str=str + 0
forces it back to a numeric value.
Any expression can be a pattern. If the pattern, in this case the expression,
evaluates to a non-zero or non-null value, the pattern matches that input record.
Patterns often involve comparison. Table 4.1 shows the valid awk comparison operators.
Table 4.1. Comparison operators in awk.
Operator |
Meaning |
== |
Equal to |
< |
Less than |
> |
Greater than |
<= |
Less than or equal to |
>= |
Greater than or equal to |
!= |
Not equal to |
~ |
Matched by |
!~ |
Not matched by |
In awk, as in C, the logical equality operator is == rather than =.
The single = assigns values, whereas == compares values. When the
pattern is a comparison, the pattern matches, if the comparison is true (non-null
or non-zero). Here's an example. What if you wanted to only print lines wherein the
first field had a numeric value of less than 20? Here's how:
$1 < 20 {print $0}
If the expression is arithmetic, it is matched when it evaluates to a non-zero
number. For example, here's a small program that will print the first 10 lines that
have exactly 7 words:
BEGIN {i=0}
NF==7 { print $0 ; i++ }
i==10 {exit}
There's another way that you could use these comparisons too, because awk understands
collation orders (that is, whether words are greater or lesser than other words in
a standard dictionary ordering). Consider the situation wherein you have a phone
directory--a sorted list of names--in a file and you want to print all the names
that would appear in the corporate phone book before a certain person, say D. Hughes.
You could do this quite succinctly:
$1 >= "Hughes,D" { exit }
When the pattern is a string, a match occurs if the expression is non-null. In
the earlier example with the pattern /Ann/, it was assumed to be a string
because it was enclosed in slashes. In a comparison expression, if both operands
have a numeric value, the comparison is based on the numeric value. Otherwise, the
comparison is made using string ordering, which is why this simple example works.
TIP: You can write more than two comparisons
to a line in awk.
The pattern $2 <= $1 could involve either a numeric comparison or
a string comparison. Whichever it is, it will vary from file to file or even from
record to record within the same file.
TIP: Know your input file well when using
such patterns, particularly since awk will often silently assume a type for the variable
and work with it, without error messages or other warnings.
String Matching
There are three forms of string matching. The simplest is to surround a string
by slashes (/). No quotation marks are used. Hence /"Ann"/
is actually the string ' "Ann" ', not the string Ann--and /"Ann"/
returns no input. The entire input record is returned if the expression within the
slashes is anywhere in the record. The other two matching operators have a more specific
scope. The operator ~ means "is matched by," and the pattern matches
when the input field being tested for a match contains the substring on the right
side.
$2 ~ /mm/
This example matches every input record containing mm somewhere in the
second field. It could also be written as $2 ~ "mm".
The other operator !~ means "is not matched by."
$2 !~ /mm/
This example matches every input record not containing mm anywhere in
the second field.
Armed with that explanation, you can now see that /Ann/ is really just
shorthand for the more complex statement $0 ~ /Ann/.
Regular expressions are common to UNIX, and they come in two main flavors. You
have probably used them subconsciously on the command line as wildcards, where *
matches zero or more characters and ? matches any single character. For
instance, entering the first line below results in the command interpreter matching
all files with the suffix abc and the rm command deleting them.
rm *abc
Awk works with regular expressions that are similar to those used with grep,
sed, and other editors but subtly different than the wildcards used with
the command shell. In particular, . matches a character and * matches
zero or more of the previous character in the pattern. (A pattern of x*y
will match anything that has any number of the letter x followed by a y.
To force a single x to appear too, you'd need to use the regular expression
xx*y instead.) By default, patterns can appear anywhere on the line, so
to have them tied to an edge, you need to use ^ to indicate the beginning
of the word or line and $ for the end. If you wanted to match all lines
where the first word ends in abc, for example, you could use $1 ~ /abc$/.
The following line matches all records where the fourth field begins with the letter
a:
$4 ~ /^a.*/
Range Patterns
The pattern portion of a pattern/action pair can also consist of two patterns
separated by a comma (,); the action is performed for all lines between
the first occurrence of the first pattern and the next occurrence of the second.
At most companies, employees receive different benefits according to their respective
hire dates. It so happens that I have a file listing all employees in my company,
including their hire dates. If I wanted to write an awk program that just lists the
employees hired between 1980 and 1987, I could use the following script, if the first
field is the employee's name and the third field is the year hired. Here's how that
data file might look. (Notice that I use : to separate fields so that we
don't have to worry about the spaces in the employee names.)
$ cat emp.data.
John Anderson:sales:1980
Joe Turner:marketing:1982
Susan Greco:sales:1985
Ike Turner:pr:1988
Bob Burmeister:accounting:1991
The program could then be invoked:
$ awk -F: '$3 == 1980,$3 == 1985 {print $1, $3}' emp.data
With the output:
John Anderson 1980
Joe Turner 1982
Susan Greco 1985
TIP: The preceding example works because
the input is already in order according to hire year. Range patterns often work best
with presorted input. This particular data file would be a bit tricky to sort within
UNIX, but you could use the rather complex command sort -c: +3 -4 -rn emp.data
> new.emp.data to sort things correctly. (See Chapter 3, "Text Editing
with vi and EMACS," for more details on using the powerful sort command.)
Range patterns are inclusive; they include both the first item matched and the
end data indicated in the pattern. The range pattern matches all records from the
first occurrence of the first pattern to the first occurrence of the second. This
is a subtle point, but it has a major affect on how range patterns work. First, if
the second pattern is never found, all remaining records match. So given the input
file here:
$ cat sample.data
1
3
5
7
9
11
The following output appears on the monitor, totally disregarding that 9
and 11 are out of range.
$ awk '$1==3, $1==8' sample.data
3
5
7
9
11
The end pattern of a range is not equivalent to a <= operand, although
liberal use of these patterns can alleviate the problem, as shown in the employee
hire date example. Using compound patterns is one way to get around this limitation.
Secondly, as stated, the pattern matches the first range; others that might occur
later in the data file are ignored. That's why you have to make sure that the data
is sorted as you expect.
CAUTION: Range patterns cannot be parts
of a larger pattern.
A more useful example of the range pattern comes from awk's capability to handle
multiple input files. I have a function finder program that finds code segments I
know exist and tells me where they are. The code segments for a particular function
X, for example, are bracketed by the phrase "function X"
at the beginning and } /* end of X at the end. It can be expressed as the
awk pattern range:
'/function functionname/,/} /* end of functionname/'
Compound Patterns
Patterns can be combined using the logical operators and parentheses as needed.
(See Table 4.2.)
Table 4.2. The logical operators in awk.
Operator |
Meaning |
! |
Not |
|| |
Or (you can also use | in regular expressions) |
&& |
And |
The pattern can be simple or quite complicated: (NF<3) || (NF >4).
This matches all input records not having exactly four fields. As is usual in awk,
there are a wide variety of ways to do the same thing (specify a pattern). Regular
expressions are allowed in string matching, but their uses are not forced. To form
a pattern that matches strings beginning with a or b or c
or d, there are several pattern options:
/^[a-d].*/
/^a.*/ !! /^b.*/ || /^c.*/ || /^d.*/
NOTE: When using range patterns: $1==2,
$1==4 and $1>= 2 && $1 <=4 are not the same ranges. First,
the range pattern depends on the occurrence of the second pattern as a stop marker,
not on the value indicated in the range. Secondly, as I mentioned earlier, the first
pattern matches only the first range; others are ignored.
For instance, consider the following simple input file:
$ cat mydata
1 0
3 1
4 1
5 1
7 0
4 2
5 2
1 0
4 3
The first range I try, '$1==3,$1==5, produces
$ awk '$1==3,$1==5' mydata
3 1
4 1
5 1
Compare this to the following pattern and output:
$ awk '$1>=3 && $1<=5' mydata
3 1
4 1
5 1
4 2
5 2
4 3
Range patterns cannot be parts of a combined pattern.
Actions
As the name suggests, the action part tells awk what to do when a pattern is found.
Patterns are optional. An awk program built solely of actions looks like other iterative
programming languages. But looks are deceptive; even without a pattern, awk matches
every input record to the first pattern-action statement before moving to the second.
Actions must be enclosed in curly braces ({}), whether accompanied by
a pattern or alone. An action part can consist of multiple statements. When the statements
have no pattern and are single statements (no compound loops or conditions), brackets
for each individual action are optional provided the actions begin with a left curly
brace and end with a right curly brace. Consider the following three action pieces:
{
name = $1;
print name;
}
and
{name = $1
print name}
and
{name = $1}
{print name}
These three produce identical output. Personally, I use the first because I find
it more readable (and I code my C programs the same way).
Variables
An integral part of any programming language are variables, the virtual
boxes within which you can store values, count things, and more. In this section,
I talk about variables in awk. Awk has three types of variables: user-defined variables,
field variables, and predefined variables that are provided by the language automatically.
Awk doesn't have variable declarations. A variable comes to life the first time it
is mentioned.
CAUTION: Because there are no declarations,
be doubly careful to initialize all the variables you use, although you can always
be sure that they automatically start with the value O.
Naming
The rule for naming user-defined variables is that they can be any combination
of letters, digits, and underscores, as long as the name starts with a letter. It
is helpful to give a variable a name indicative of its purpose in the program. Variables
already defined by awk are written in all uppercase. Because awk is case-sensitive,
ofs is not the same variable as OFS and capitalization (or lack
thereof) is a common error. You have already seen field variables--variables beginning
with $, followed by a number, and indicating a specific input field.
A variable is a number, string, or both. There is no type declaration, and type
conversion is automatic if needed. Recall the car sales file used earlier. For illustration,
suppose I entered the program awk -F: '{ print $1 * 10}' emp.data; awk obligingly
provides the rest:
0
0
0
0
0
Of course, this makes no sense. The point is that awk did exactly what it was
asked without complaint: It multiplied the name of the employee times 10, and when
it tried to translate the name into a number for the mathematical operation it failed,
resulting in a zero. Ten times zero is still zero.
Awk in a Shell Script
Before examining the next example, review what you know about shell programming
(Chapters 8-13 of Volume I). Remember, every file containing shell commands needs
to be changed to an executable file before you can run it as a shell script. To do
this, enter chmod +x filename from the command line.
Sometimes, awk's automatic type conversion benefits you. Imagine that I'm still
trying to build an office system with awk scripts and this time I want to be able
to maintain a running monthly sales total based on a data file that contains individual
monthly sales. It looks like this:
$ cat monthly.sales
John Anderson,12,23,7
Joe Turner,10,25,15
Susan Greco,15,13,18
Bob Burmeister,8,21,17
These need to be added together to calculate the running totals for each person's
sales. Let a program do it!
$cat total.awk
BEGIN {FS=","; #Input fields are seperated by commas
OFS=",";} #Put a comma in the output
{print $1, " monthly sales summary: " $2+$3+$4 }
That's the awk script, so let's see how it works:
$ awk -f total.awk monthly.sales
John Anderson, monthly sales summary: 42
Joe Turner, monthly sales summary: 50
Susan Greco, monthly sales summary: 46
Bob Burmeister, monthly sales summary: 46
CAUTION: Always run your program once
to be sure it works before you make it part of a complicated shell script.
The shell script used to run the awk script would look like this:
#! /bin/ksh # always specify your shell
awk -f total.awk monthly.sales
exit $? # return awk's return code
Your task has been reduced to entering the monthly sales figures in the sales
file and editing the program file total to include the correct number of fields.
(You could put in a for loop like for(i=2; i<+NF; i++), the
number of fields is correctly calculated--but printing is a hassle and needs an if
statement with 12 else if clauses.)
In this case, not having to wonder whether a digit is part of a string or a number
is helpful. Just keep an eye on the input data, because awk performs whatever actions
you specify, regardless of the actual data type with which you're working.
Built-In Variables
The built-in variables found in awk provide useful data to your program. The ones
available vary with each of awk versions; for that reason, notes are included for
those variables found in nawk, POSIX awk, and gawk. As before, unless otherwise noted,
the variables of earlier releases can be found in the later implementations. The
built-in variables are summarized in Table 4.3 at the end of this section.
Awk was released first and contains the core set of built-in variables used by
all updates. Nawk expands the set. The POSIX awk specification encompasses all variables
defined in nawk plus one additional variable. Gawk applies the POSIX awk standards
and then adds some built-in variables that are found in gawk alone; the built-in
variables noted when discussing gawk are unique to gawk. This list is a guideline,
not a hard and fast rule. For instance, the built-in variable ENVIRON is
formally introduced in the POSIX awk specifications; it exists in gawk; it is in
also in the System V implementation of nawk, but not in SunOS. (See Chapter 5, Volume
I, "General Commands," for more information on how to use man pages.)
In all implementations of awk, built-in variables are written entirely in uppercase.
Built-In Variables for Awk When awk first became a part of UNIX, the built-in
variables were the bare essentials. As the name indicates, the variable FILENAME
holds the name of the current input file. Recall the function finder code; and add
on the new line:
/function functionname/,/} /* end of functionname/' {print $0}
END {print ""; print "Found in the file " FILENAME}
This adds the finishing touch.
The value of the variable FS determines the input field separator. FS
has a space as its default value. The built-in variable NF contains the
number of fields in the current record. (Remember, fields are akin to words, and
records are input lines.) This value can change for each input record.
What happens if within an awk script I have the following statement?
$3 = "Third field"
It reassigns $3 and all other field variables, also reassigning NF
to the new value. The total number of records read can be found in the variable NR.
The variable OFS holds the value for the output field separator. The default
value of OFS is a space. The value for the output format for numbers resides
in the variable OFMT, which has a default value of %.6g. This is
the format specifier for the print statement, although its syntax comes
from the C printf format string. ORS is the output record separator.
Unless changed, the value of ORS is newline (n).
Built-In Variables for Nawk
NOTE: When awk was expanded in 1985, part
of the expansion included adding more built-in variables.
CAUTION: Some implementations of UNIX
simply put the new code in the spot for the old code and didn't bother keeping both
awk and nawk. System V and SunOS have both available. Linux has neither awk nor nawk
but uses gawk. The book The Awk Programming Language (see the "Further
Reading" section at the end of this chapter) by the awk authors speaks of awk
throughout the book, but the programming language it describes is called nawk on
many systems.
The built-in variable ARGC holds the value for the number of command-line
arguments. The variable ARGV is an array containing the command-line arguments.
Subscripts for ARGV begin with 0 and continue through ARGC-1.
ARGV[0] is always awk. The available UNIX options do not occupy ARGV.
The variable FNR represents the number of the current record within that
input file. Like NR, this value changes with each new record. FNR
is always <= NR. The built-in variable RLENGTH holds
the value of the length of string matched by the match function. The variable RS
holds the value of the input record separator. The default value of RS is
a newline. The start of the string matched by the match function resides
in RSTART. Between RSTART and RLENGTH, it is possible
to determine what was matched. The variable SUBSEP contains the value of
the subscript separator. It has a default value of " |