Friday, April 27, 2012

Cracking HTML with Unix

Our company uses the enterprise social networking site Yammer. For those who may not be familiar with it, it is a sort of internal Facebook.

Somebody had created a graph of how membership had increased over the months, but the person had left the company. People were interested to see how the membership had increased in the last year. I decided to look at the problem.

Yammer has a members page and the page shows the date the person joined.

If we look at the source of the page, it looks something like this:

<td class="identity">
    <div class="info">
    <img alt="Status_on" ... />
    <a href="???" class="primary yj-hovercard-link" title="view profile">???&/a>
    </div>
</td>

<td class="joined_at">
    Nov 24, 2010
</td>

So, I saved the source for all the members in a file called members.txt. This file had about 75,000 lines.

The part I was interested in was the last three lines, or more particularly, the actual date joined. I figured that if I had all the dates joined, I could create the required histogram.

The easiest way to do this was to use grep to find all the lines containing joined_at and then use the context option (-A 1) the show the line following. This gave:


$ grep -A 1 joined_at members.txt
<td class="joined_at">
    Nov 24, 2010
--
<td class="joined_at">
    Nov 24, 2010
--
<td class="joined_at">
    Nov 28, 2010
...

To clean up the output, I used grep -v which gave:

$ grep -A 1 joined_at members.txt | grep -v joined | grep -v -- -- 
    Nov 24, 2010
    Nov 24, 2010
    Nov 28, 2010
...

I was not interested in the day on which people joined, only the month and year. Also, I wanted it in the format year month. This was easily accomplished using AWK

awk '{print $3, $1}' 

In other words, print the third and first field of the output.

We now have:

... | awk '{print $3, $1}'
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2009 Oct
...

As you will notice, the data is not necessarily in sorted order. In order to sort numerically, we need the month number, not name, for the month. Converting 'Nov' to '10' is done easily using sed. I won''t type the full command, but it looks like this:

sed 's/Jan/01/;s/Feb/02/;s/Mar/03/;...'

So, we now have:

2010 11
2010 11
2010 11
2009 10
...

All that is left to do is to sort the output numerically (sort -n) and then count the number of unique occurrences of each line (uniq -c).

Doing this gives us:

   9 2009 10
   1 2010 10
  64 2010 11
 112 2010 12
 403 2011 01
  60 2011 02
  55 2011 03
  23 2011 04
  33 2011 05
  36 2011 06
  18 2011 07
  60 2011 08
  31 2011 09
  42 2011 10
  22 2011 11
  21 2011 12
  22 2012 01
  23 2012 02
  10 2012 03
  40 2012 04

A histogram showing the number of people that joined each month since 2009. There is an interesting network effect in the above data. As soon as the site grew past a critical point, there was an explosion of new members (which took the number of members to just under half of all members of the organisation) and then the members signing up slowed and became almost constant.

However, what I wanted to show in this post was how having a basic knowledge of Unix tools, it is possible to do some reasonably advanced analytics on what may initially seem to be quite unstructured data.

Saturday, March 31, 2012

Let's tawk about AWK

Recently I was at a Python conference and I listened to a speaker tell us that he used to use Fortran, C, Lisp, AWK and a few other languages, but now that he uses Python, he no longer needs to use those other languages.

This surprised me. Although I use Python whenever I can and really enjoy using it, I don't always think it is the best or most suitable language to use for every occasion. In particular, it was AWK which made me think "Why would someone stop using AWK and use Python instead".

I have been meaning to write a short AWK tutorial for months, and recently came across a coworker who I regard as a good programmer who had not heard of AWK which inspired me to start writing this tutorial.

So, in case you've never heard of AWK, what is it? It is a small language that has been around since 1977 and was part of Version 7 Unix.

It is very simple. The structure of an AWK program is generally a series of lines where each line of the series has the following format:

expression { statement }
expression { statement }
...

AWK is like grep in that it looks at each line of a file and processes the file on a line by line basis. However, it will be much easier to demonstrate than explain.

Suppose we have a file with the following contents:

...
9021 2000-12-27 0 0
9021 2000-12-28 0 0
9021 2000-12-29 .6 0
9021 2000-12-30 0 0
9021 2000-12-31 0 0
86071 2000-01-01 .4 0
86071 2000-01-02 0 0
86071 2000-01-03 0 0
86071 2000-01-04 3.4 0
...

The first column (affectionately known as $1 in AWK), is the identifier for a weather station. The second column ($2) is obviously a date, the third column is the amount of rain that fell in the previous 24 hours and the 4th column is a quality flag.

Suppose we wanted to print all records, but only for station 9021, we could do it with the following program:

awk '$1 == 9021 {print $0}'  rain_data.txt
This will give us:
9021 2000-01-01 0 0
9021 2000-01-02 0 0
9021 2000-01-03 0 0
9021 2000-01-04 0 0
9021 2000-01-05 0 0
9021 2000-01-06 0 0
...

In the above example, the expression was '$1 == 9021' and the statement was '{print $0}'. $0 is AWK shorthand for "the whole line".

If we have an expression and we don't have a statement, then the default action is to print the whole line, so we can shorten the above to:

awk '$1 == 9021'  rain_data.txt

Also, we should note that the default field separator is whitespace (it can be one or more spaces or tabs). Now, suppose we wanted to display records for days that actually had some rainfall, we could write:

awk '$3 > 0.0' rain_data.txt
This will give:
9021 2000-01-12 .8 0
9021 2000-01-15 13 0
9021 2000-01-16 5 0
9021 2000-01-17 4.6 0
...
Now suppose we want days where the rainfall was above 10mm and the station was 86071, we could do it as follows:
awk '$3 > 10.0 && $1 == 86071' rain_data.txt
9021 2000-01-12 .8 0
9021 2000-01-15 13 0
9021 2000-01-16 5 0
9021 2000-01-17 4.6 0
...

We have looked at the expressions, so now lets look at the statements. Often we want to print a subset of the fields. Lets print out only the first three fields (in other words omit the quality field). Since we are not printing the quality field, we will select only records where the quality is good (equal to zero).

awk '$3 > 10.0 && $4 == 0 {print $1, $2, $3}' rain_data.txt
This gives:
9021 2000-01-15 13
9021 2000-01-22 60.8
9021 2000-01-23 14.4
9021 2000-03-22 24
...
Now, we will print the rainfall in inches (mm / 25.4)
awk '$3 > 10.0 && $4 == 0 {print $1, $2, $3/25.4}' rain_data.txt
Which gives
9021 2000-01-15 0.511811
9021 2000-01-22 2.3937
9021 2000-01-23 0.566929
9021 2000-03-22 0.944882
...

There are two special expressions in AWK, 'BEGIN' and 'END'. BEGIN matches before the first line of a file has been read and END matches after the last line of text has been read. These are often used for initialisation or printing totals.

As soon as we start doing anything more complex than a single line, it makes sense to create a small script, for example prog.awk, which we then run as follows:

awk -f prog.awk datafile

In the first script, we want to find the maximum rainfall of all the values in our file. We create the following script (maxrain.awk)

    { if ($3 > max) {max = $3} }
END { print max }

The first line does not have an expression, so it matches all records. The if statement says that if the value in the third column (rainfall), is greater than the current value in max, replace it. As mentioned above, 'END' will match after all lines have been processed and will print out the value of max.

We will look at one final aspect of AWK which is associative arrays. An associative array is an array where the index does not have to be an integer (although in this case it will be). We also don't have to declare the size of the array as we do in many other languages.

In this example, we will find the maximum for each of the stations. We do this by creating an array of maximum values where the station number is the index of the array. So, for example, we have station number 86071 and station number 9021. maxarray[86071] will contain the maximum value that we have found so far for that station and maxarray[9021] will contain the maximum value for station 9021.

    { if ($3 > maxarray[$1]) {maxarray[$1] = $3} }
END { for (station in maxarray) print station, maxarray[station] }
Running this we get the following output:
9021 60.8
86071 38.4
...

AWK is a small language and a very powerful language. It is not a general purpose language like Python with a rich library. But I sometimes use it many times a day, because those things that it does well, it does very well. It is also very simple and concise. I rarely need to look up any of the functions of AWK. It is concise enough that many times it can just be typed in from the command line. Any software engineer should have many tools in their toolbelt and choose the best tool for the job. It is surprising how often a little language like AWK will turn out to be that tool.

Friday, March 30, 2012

Quicksort in Python

I have been brushing up on my algorithms. One of the algorithms that I looked at was Quicksort. It is a sorting algorithm developed by Tony Hoare in 1960. If you are using the built in sort function in C, you are using quicksort.

Quicksort is one of those algorithms you study in university and then often forget. However, as I mention, I was brushing up on my algorithms and so decided to revisit quicksort. The way I usually understand algorithms is to first look at the algorithm and then look at an implementation.

Here is a quick description of the algorithm:

Quicksort:
  1. Take an array to be sorted.
  2. If the array has fewer than two values, we will consider it sorted (return).
  3. Choose a value which we will call the pivot value. This can be the first element of the array.
  4. Rearrange the array so that all values less than the pivot value are placed before the pivot value and all other values are placed after the pivot value.
  5. Run Quicksort over each of the two subarrays created in the previous step.
Here is an implemenation in C (from http://rosettacode.org/wiki/Quicksort):

void quick_sort (int *a, int n) {
    if (n < 2)
        return;
    int p = a[n / 2];
    int *l = a;
    int *r = a + n - 1;
    while (l <= r) {
        while (*l < p)
            l++;
        while (*r > p)
            r--;
        if (l <= r) {
            int t = *l;
            *l++ = *r;
            *r-- = t;
        }
    }
    quick_sort(a, r - a + 1);
    quick_sort(l, a + n - l);
}
 
int main () {
    int a[] = {4, 65, 2, -31, 0, 99, 2, 83, 782, 1};
    int n = sizeof a / sizeof a[0];
    quick_sort(a, n);
    return 0;
}

While we can see what the above code is doing, and we can map the variable names to the description from the algorithm, it is not immediately clear how the algorithm works.

In fact, if we replaced the name quick_sort by q, and asked someone what the code was doing, it would be reasonably hard for most people to guess that it was a quicksort.

The next thing I did was to look for a Python implementation and found this one (http://en.literateprograms.org/Quicksort_(Python)?printable=yes):

def qsort1(list):
    """
    Quicksort using list comprehensions
    """
    if list == []: 
        return []
    else:
        pivot = list[0]
        lesser = qsort1([x for x in list[1:] if x < pivot])
        greater = qsort1([x for x in list[1:] if x >= pivot])
        return lesser + [pivot] + greater

Immediately it is obvious what the algorithm is doing. Looking at the algorithm, though, we notice that the first line of the function:

if list == []

Looks for an empty list. This is because if we get an empty list, we can assume it is sorted. However, if we get a list with only one element, we can also assume it is sorted. I changed the lines to:

if len(list) <= 1: 
    return list

I measured the speed of the two algorithms and the second gave me a 10% improvement in speed. Would I have noticed this so easily in the C code? I am not sure. However, having noticed this in the Python code, it seemed fairly obvious that it could be improved and I also felt confident that my improvement was going to work.

So, even though Python may not be as efficient as C or some compiled languages, the clarity of the code certainly allows us to make improvements and optimisations that may otherwise have been very difficult.