Saturday, March 31, 2012

Let's tawk about AWK

Recently I was at a Python conference and I listened to a speaker tell us that he used to use Fortran, C, Lisp, AWK and a few other languages, but now that he uses Python, he no longer needs to use those other languages.

This surprised me. Although I use Python whenever I can and really enjoy using it, I don't always think it is the best or most suitable language to use for every occasion. In particular, it was AWK which made me think "Why would someone stop using AWK and use Python instead".

I have been meaning to write a short AWK tutorial for months, and recently came across a coworker who I regard as a good programmer who had not heard of AWK which inspired me to start writing this tutorial.

So, in case you've never heard of AWK, what is it? It is a small language that has been around since 1977 and was part of Version 7 Unix.

It is very simple. The structure of an AWK program is generally a series of lines where each line of the series has the following format:

expression { statement }
expression { statement }

AWK is like grep in that it looks at each line of a file and processes the file on a line by line basis. However, it will be much easier to demonstrate than explain.

Suppose we have a file with the following contents:

9021 2000-12-27 0 0
9021 2000-12-28 0 0
9021 2000-12-29 .6 0
9021 2000-12-30 0 0
9021 2000-12-31 0 0
86071 2000-01-01 .4 0
86071 2000-01-02 0 0
86071 2000-01-03 0 0
86071 2000-01-04 3.4 0

The first column (affectionately known as $1 in AWK), is the identifier for a weather station. The second column ($2) is obviously a date, the third column is the amount of rain that fell in the previous 24 hours and the 4th column is a quality flag.

Suppose we wanted to print all records, but only for station 9021, we could do it with the following program:

awk '$1 == 9021 {print $0}'  rain_data.txt
This will give us:
9021 2000-01-01 0 0
9021 2000-01-02 0 0
9021 2000-01-03 0 0
9021 2000-01-04 0 0
9021 2000-01-05 0 0
9021 2000-01-06 0 0

In the above example, the expression was '$1 == 9021' and the statement was '{print $0}'. $0 is AWK shorthand for "the whole line".

If we have an expression and we don't have a statement, then the default action is to print the whole line, so we can shorten the above to:

awk '$1 == 9021'  rain_data.txt

Also, we should note that the default field separator is whitespace (it can be one or more spaces or tabs). Now, suppose we wanted to display records for days that actually had some rainfall, we could write:

awk '$3 > 0.0' rain_data.txt
This will give:
9021 2000-01-12 .8 0
9021 2000-01-15 13 0
9021 2000-01-16 5 0
9021 2000-01-17 4.6 0
Now suppose we want days where the rainfall was above 10mm and the station was 86071, we could do it as follows:
awk '$3 > 10.0 && $1 == 86071' rain_data.txt
9021 2000-01-12 .8 0
9021 2000-01-15 13 0
9021 2000-01-16 5 0
9021 2000-01-17 4.6 0

We have looked at the expressions, so now lets look at the statements. Often we want to print a subset of the fields. Lets print out only the first three fields (in other words omit the quality field). Since we are not printing the quality field, we will select only records where the quality is good (equal to zero).

awk '$3 > 10.0 && $4 == 0 {print $1, $2, $3}' rain_data.txt
This gives:
9021 2000-01-15 13
9021 2000-01-22 60.8
9021 2000-01-23 14.4
9021 2000-03-22 24
Now, we will print the rainfall in inches (mm / 25.4)
awk '$3 > 10.0 && $4 == 0 {print $1, $2, $3/25.4}' rain_data.txt
Which gives
9021 2000-01-15 0.511811
9021 2000-01-22 2.3937
9021 2000-01-23 0.566929
9021 2000-03-22 0.944882

There are two special expressions in AWK, 'BEGIN' and 'END'. BEGIN matches before the first line of a file has been read and END matches after the last line of text has been read. These are often used for initialisation or printing totals.

As soon as we start doing anything more complex than a single line, it makes sense to create a small script, for example prog.awk, which we then run as follows:

awk -f prog.awk datafile

In the first script, we want to find the maximum rainfall of all the values in our file. We create the following script (maxrain.awk)

    { if ($3 > max) {max = $3} }
END { print max }

The first line does not have an expression, so it matches all records. The if statement says that if the value in the third column (rainfall), is greater than the current value in max, replace it. As mentioned above, 'END' will match after all lines have been processed and will print out the value of max.

We will look at one final aspect of AWK which is associative arrays. An associative array is an array where the index does not have to be an integer (although in this case it will be). We also don't have to declare the size of the array as we do in many other languages.

In this example, we will find the maximum for each of the stations. We do this by creating an array of maximum values where the station number is the index of the array. So, for example, we have station number 86071 and station number 9021. maxarray[86071] will contain the maximum value that we have found so far for that station and maxarray[9021] will contain the maximum value for station 9021.

    { if ($3 > maxarray[$1]) {maxarray[$1] = $3} }
END { for (station in maxarray) print station, maxarray[station] }
Running this we get the following output:
9021 60.8
86071 38.4

AWK is a small language and a very powerful language. It is not a general purpose language like Python with a rich library. But I sometimes use it many times a day, because those things that it does well, it does very well. It is also very simple and concise. I rarely need to look up any of the functions of AWK. It is concise enough that many times it can just be typed in from the command line. Any software engineer should have many tools in their toolbelt and choose the best tool for the job. It is surprising how often a little language like AWK will turn out to be that tool.

No comments:

Post a Comment