Saturday, March 31, 2012

Let's tawk about AWK

Recently I was at a Python conference and I listened to a speaker tell us that he used to use Fortran, C, Lisp, AWK and a few other languages, but now that he uses Python, he no longer needs to use those other languages.

This surprised me. Although I use Python whenever I can and really enjoy using it, I don't always think it is the best or most suitable language to use for every occasion. In particular, it was AWK which made me think "Why would someone stop using AWK and use Python instead".

I have been meaning to write a short AWK tutorial for months, and recently came across a coworker who I regard as a good programmer who had not heard of AWK which inspired me to start writing this tutorial.

So, in case you've never heard of AWK, what is it? It is a small language that has been around since 1977 and was part of Version 7 Unix.

It is very simple. The structure of an AWK program is generally a series of lines where each line of the series has the following format:

expression { statement }
expression { statement }
...

AWK is like grep in that it looks at each line of a file and processes the file on a line by line basis. However, it will be much easier to demonstrate than explain.

Suppose we have a file with the following contents:

...
9021 2000-12-27 0 0
9021 2000-12-28 0 0
9021 2000-12-29 .6 0
9021 2000-12-30 0 0
9021 2000-12-31 0 0
86071 2000-01-01 .4 0
86071 2000-01-02 0 0
86071 2000-01-03 0 0
86071 2000-01-04 3.4 0
...

The first column (affectionately known as $1 in AWK), is the identifier for a weather station. The second column ($2) is obviously a date, the third column is the amount of rain that fell in the previous 24 hours and the 4th column is a quality flag.

Suppose we wanted to print all records, but only for station 9021, we could do it with the following program:

awk '$1 == 9021 {print $0}'  rain_data.txt
This will give us:
9021 2000-01-01 0 0
9021 2000-01-02 0 0
9021 2000-01-03 0 0
9021 2000-01-04 0 0
9021 2000-01-05 0 0
9021 2000-01-06 0 0
...

In the above example, the expression was '$1 == 9021' and the statement was '{print $0}'. $0 is AWK shorthand for "the whole line".

If we have an expression and we don't have a statement, then the default action is to print the whole line, so we can shorten the above to:

awk '$1 == 9021'  rain_data.txt

Also, we should note that the default field separator is whitespace (it can be one or more spaces or tabs). Now, suppose we wanted to display records for days that actually had some rainfall, we could write:

awk '$3 > 0.0' rain_data.txt
This will give:
9021 2000-01-12 .8 0
9021 2000-01-15 13 0
9021 2000-01-16 5 0
9021 2000-01-17 4.6 0
...
Now suppose we want days where the rainfall was above 10mm and the station was 86071, we could do it as follows:
awk '$3 > 10.0 && $1 == 86071' rain_data.txt
9021 2000-01-12 .8 0
9021 2000-01-15 13 0
9021 2000-01-16 5 0
9021 2000-01-17 4.6 0
...

We have looked at the expressions, so now lets look at the statements. Often we want to print a subset of the fields. Lets print out only the first three fields (in other words omit the quality field). Since we are not printing the quality field, we will select only records where the quality is good (equal to zero).

awk '$3 > 10.0 && $4 == 0 {print $1, $2, $3}' rain_data.txt
This gives:
9021 2000-01-15 13
9021 2000-01-22 60.8
9021 2000-01-23 14.4
9021 2000-03-22 24
...
Now, we will print the rainfall in inches (mm / 25.4)
awk '$3 > 10.0 && $4 == 0 {print $1, $2, $3/25.4}' rain_data.txt
Which gives
9021 2000-01-15 0.511811
9021 2000-01-22 2.3937
9021 2000-01-23 0.566929
9021 2000-03-22 0.944882
...

There are two special expressions in AWK, 'BEGIN' and 'END'. BEGIN matches before the first line of a file has been read and END matches after the last line of text has been read. These are often used for initialisation or printing totals.

As soon as we start doing anything more complex than a single line, it makes sense to create a small script, for example prog.awk, which we then run as follows:

awk -f prog.awk datafile

In the first script, we want to find the maximum rainfall of all the values in our file. We create the following script (maxrain.awk)

    { if ($3 > max) {max = $3} }
END { print max }

The first line does not have an expression, so it matches all records. The if statement says that if the value in the third column (rainfall), is greater than the current value in max, replace it. As mentioned above, 'END' will match after all lines have been processed and will print out the value of max.

We will look at one final aspect of AWK which is associative arrays. An associative array is an array where the index does not have to be an integer (although in this case it will be). We also don't have to declare the size of the array as we do in many other languages.

In this example, we will find the maximum for each of the stations. We do this by creating an array of maximum values where the station number is the index of the array. So, for example, we have station number 86071 and station number 9021. maxarray[86071] will contain the maximum value that we have found so far for that station and maxarray[9021] will contain the maximum value for station 9021.

    { if ($3 > maxarray[$1]) {maxarray[$1] = $3} }
END { for (station in maxarray) print station, maxarray[station] }
Running this we get the following output:
9021 60.8
86071 38.4
...

AWK is a small language and a very powerful language. It is not a general purpose language like Python with a rich library. But I sometimes use it many times a day, because those things that it does well, it does very well. It is also very simple and concise. I rarely need to look up any of the functions of AWK. It is concise enough that many times it can just be typed in from the command line. Any software engineer should have many tools in their toolbelt and choose the best tool for the job. It is surprising how often a little language like AWK will turn out to be that tool.

Friday, March 30, 2012

Quicksort in Python

I have been brushing up on my algorithms. One of the algorithms that I looked at was Quicksort. It is a sorting algorithm developed by Tony Hoare in 1960. If you are using the built in sort function in C, you are using quicksort.

Quicksort is one of those algorithms you study in university and then often forget. However, as I mention, I was brushing up on my algorithms and so decided to revisit quicksort. The way I usually understand algorithms is to first look at the algorithm and then look at an implementation.

Here is a quick description of the algorithm:

Quicksort:
  1. Take an array to be sorted.
  2. If the array has fewer than two values, we will consider it sorted (return).
  3. Choose a value which we will call the pivot value. This can be the first element of the array.
  4. Rearrange the array so that all values less than the pivot value are placed before the pivot value and all other values are placed after the pivot value.
  5. Run Quicksort over each of the two subarrays created in the previous step.
Here is an implemenation in C (from http://rosettacode.org/wiki/Quicksort):

void quick_sort (int *a, int n) {
    if (n < 2)
        return;
    int p = a[n / 2];
    int *l = a;
    int *r = a + n - 1;
    while (l <= r) {
        while (*l < p)
            l++;
        while (*r > p)
            r--;
        if (l <= r) {
            int t = *l;
            *l++ = *r;
            *r-- = t;
        }
    }
    quick_sort(a, r - a + 1);
    quick_sort(l, a + n - l);
}
 
int main () {
    int a[] = {4, 65, 2, -31, 0, 99, 2, 83, 782, 1};
    int n = sizeof a / sizeof a[0];
    quick_sort(a, n);
    return 0;
}

While we can see what the above code is doing, and we can map the variable names to the description from the algorithm, it is not immediately clear how the algorithm works.

In fact, if we replaced the name quick_sort by q, and asked someone what the code was doing, it would be reasonably hard for most people to guess that it was a quicksort.

The next thing I did was to look for a Python implementation and found this one (http://en.literateprograms.org/Quicksort_(Python)?printable=yes):

def qsort1(list):
    """
    Quicksort using list comprehensions
    """
    if list == []: 
        return []
    else:
        pivot = list[0]
        lesser = qsort1([x for x in list[1:] if x < pivot])
        greater = qsort1([x for x in list[1:] if x >= pivot])
        return lesser + [pivot] + greater

Immediately it is obvious what the algorithm is doing. Looking at the algorithm, though, we notice that the first line of the function:

if list == []

Looks for an empty list. This is because if we get an empty list, we can assume it is sorted. However, if we get a list with only one element, we can also assume it is sorted. I changed the lines to:

if len(list) <= 1: 
    return list

I measured the speed of the two algorithms and the second gave me a 10% improvement in speed. Would I have noticed this so easily in the C code? I am not sure. However, having noticed this in the Python code, it seemed fairly obvious that it could be improved and I also felt confident that my improvement was going to work.

So, even though Python may not be as efficient as C or some compiled languages, the clarity of the code certainly allows us to make improvements and optimisations that may otherwise have been very difficult.