Our company uses the enterprise social networking site Yammer. For those who may not be familiar with it, it is a sort of internal Facebook.
Somebody had created a graph of how membership had increased over the months, but the person had left the company. People were interested to see how the membership had increased in the last year. I decided to look at the problem.
Yammer has a members page and the page shows the date the person joined.
If we look at the source of the page, it looks something like this:
<td class="identity">
<div class="info">
<img alt="Status_on" ... />
<a href="???" class="primary yj-hovercard-link" title="view profile">???&/a>
</div>
</td>
<td class="joined_at">
Nov 24, 2010
</td>
So, I saved the source for all the members in a file called members.txt. This file had about 75,000 lines.
The part I was interested in was the last three lines, or more particularly, the actual date joined. I figured that if I had all the dates joined, I could create the required histogram.
The easiest way to do this was to use grep to find all the lines containing joined_at and then use the context option (-A 1) the show the line following. This gave:
$ grep -A 1 joined_at members.txt
<td class="joined_at">
Nov 24, 2010
--
<td class="joined_at">
Nov 24, 2010
--
<td class="joined_at">
Nov 28, 2010
...
To clean up the output, I used grep -v which gave:
$ grep -A 1 joined_at members.txt | grep -v joined | grep -v -- --
Nov 24, 2010
Nov 24, 2010
Nov 28, 2010
...
I was not interested in the day on which people joined, only the month and year. Also, I wanted it in the format year month. This was easily accomplished using AWK
awk '{print $3, $1}'
In other words, print the third and first field of the output.
We now have:
... | awk '{print $3, $1}'
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2009 Oct
...
As you will notice, the data is not necessarily in sorted order. In order to sort numerically, we need the month number, not name, for the month. Converting 'Nov' to '10' is done easily using sed. I won''t type the full command, but it looks like this:
sed 's/Jan/01/;s/Feb/02/;s/Mar/03/;...'
So, we now have:
2010 11
2010 11
2010 11
2009 10
...
All that is left to do is to sort the output numerically (sort -n) and then count the number of unique occurrences of each line (uniq -c).
Doing this gives us:
9 2009 10
1 2010 10
64 2010 11
112 2010 12
403 2011 01
60 2011 02
55 2011 03
23 2011 04
33 2011 05
36 2011 06
18 2011 07
60 2011 08
31 2011 09
42 2011 10
22 2011 11
21 2011 12
22 2012 01
23 2012 02
10 2012 03
40 2012 04
A histogram showing the number of people that joined each month since 2009. There is an interesting network effect in the above data. As soon as the site grew past a critical point, there was an explosion of new members (which took the number of members to just under half of all members of the organisation) and then the members signing up slowed and became almost constant.
However, what I wanted to show in this post was how having a basic knowledge of Unix tools, it is possible to do some reasonably advanced analytics on what may initially seem to be quite unstructured data.