Friday, April 27, 2012

Cracking HTML with Unix

Our company uses the enterprise social networking site Yammer. For those who may not be familiar with it, it is a sort of internal Facebook.

Somebody had created a graph of how membership had increased over the months, but the person had left the company. People were interested to see how the membership had increased in the last year. I decided to look at the problem.

Yammer has a members page and the page shows the date the person joined.

If we look at the source of the page, it looks something like this:

<td class="identity">
    <div class="info">
    <img alt="Status_on" ... />
    <a href="???" class="primary yj-hovercard-link" title="view profile">???&/a>
    </div>
</td>

<td class="joined_at">
    Nov 24, 2010
</td>

So, I saved the source for all the members in a file called members.txt. This file had about 75,000 lines.

The part I was interested in was the last three lines, or more particularly, the actual date joined. I figured that if I had all the dates joined, I could create the required histogram.

The easiest way to do this was to use grep to find all the lines containing joined_at and then use the context option (-A 1) the show the line following. This gave:


$ grep -A 1 joined_at members.txt
<td class="joined_at">
    Nov 24, 2010
--
<td class="joined_at">
    Nov 24, 2010
--
<td class="joined_at">
    Nov 28, 2010
...

To clean up the output, I used grep -v which gave:

$ grep -A 1 joined_at members.txt | grep -v joined | grep -v -- -- 
    Nov 24, 2010
    Nov 24, 2010
    Nov 28, 2010
...

I was not interested in the day on which people joined, only the month and year. Also, I wanted it in the format year month. This was easily accomplished using AWK

awk '{print $3, $1}' 

In other words, print the third and first field of the output.

We now have:

... | awk '{print $3, $1}'
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2009 Oct
...

As you will notice, the data is not necessarily in sorted order. In order to sort numerically, we need the month number, not name, for the month. Converting 'Nov' to '10' is done easily using sed. I won''t type the full command, but it looks like this:

sed 's/Jan/01/;s/Feb/02/;s/Mar/03/;...'

So, we now have:

2010 11
2010 11
2010 11
2009 10
...

All that is left to do is to sort the output numerically (sort -n) and then count the number of unique occurrences of each line (uniq -c).

Doing this gives us:

   9 2009 10
   1 2010 10
  64 2010 11
 112 2010 12
 403 2011 01
  60 2011 02
  55 2011 03
  23 2011 04
  33 2011 05
  36 2011 06
  18 2011 07
  60 2011 08
  31 2011 09
  42 2011 10
  22 2011 11
  21 2011 12
  22 2012 01
  23 2012 02
  10 2012 03
  40 2012 04

A histogram showing the number of people that joined each month since 2009. There is an interesting network effect in the above data. As soon as the site grew past a critical point, there was an explosion of new members (which took the number of members to just under half of all members of the organisation) and then the members signing up slowed and became almost constant.

However, what I wanted to show in this post was how having a basic knowledge of Unix tools, it is possible to do some reasonably advanced analytics on what may initially seem to be quite unstructured data.

No comments:

Post a Comment