Friday, April 27, 2012

Cracking HTML with Unix

Our company uses the enterprise social networking site Yammer. For those who may not be familiar with it, it is a sort of internal Facebook.

Somebody had created a graph of how membership had increased over the months, but the person had left the company. People were interested to see how the membership had increased in the last year. I decided to look at the problem.

Yammer has a members page and the page shows the date the person joined.

If we look at the source of the page, it looks something like this:

<td class="identity">
    <div class="info">
    <img alt="Status_on" ... />
    <a href="???" class="primary yj-hovercard-link" title="view profile">???&/a>
    </div>
</td>

<td class="joined_at">
    Nov 24, 2010
</td>

So, I saved the source for all the members in a file called members.txt. This file had about 75,000 lines.

The part I was interested in was the last three lines, or more particularly, the actual date joined. I figured that if I had all the dates joined, I could create the required histogram.

The easiest way to do this was to use grep to find all the lines containing joined_at and then use the context option (-A 1) the show the line following. This gave:


$ grep -A 1 joined_at members.txt
<td class="joined_at">
    Nov 24, 2010
--
<td class="joined_at">
    Nov 24, 2010
--
<td class="joined_at">
    Nov 28, 2010
...

To clean up the output, I used grep -v which gave:

$ grep -A 1 joined_at members.txt | grep -v joined | grep -v -- -- 
    Nov 24, 2010
    Nov 24, 2010
    Nov 28, 2010
...

I was not interested in the day on which people joined, only the month and year. Also, I wanted it in the format year month. This was easily accomplished using AWK

awk '{print $3, $1}' 

In other words, print the third and first field of the output.

We now have:

... | awk '{print $3, $1}'
2010 Nov
2010 Nov
2010 Nov
2010 Nov
2009 Oct
...

As you will notice, the data is not necessarily in sorted order. In order to sort numerically, we need the month number, not name, for the month. Converting 'Nov' to '10' is done easily using sed. I won''t type the full command, but it looks like this:

sed 's/Jan/01/;s/Feb/02/;s/Mar/03/;...'

So, we now have:

2010 11
2010 11
2010 11
2009 10
...

All that is left to do is to sort the output numerically (sort -n) and then count the number of unique occurrences of each line (uniq -c).

Doing this gives us:

   9 2009 10
   1 2010 10
  64 2010 11
 112 2010 12
 403 2011 01
  60 2011 02
  55 2011 03
  23 2011 04
  33 2011 05
  36 2011 06
  18 2011 07
  60 2011 08
  31 2011 09
  42 2011 10
  22 2011 11
  21 2011 12
  22 2012 01
  23 2012 02
  10 2012 03
  40 2012 04

A histogram showing the number of people that joined each month since 2009. There is an interesting network effect in the above data. As soon as the site grew past a critical point, there was an explosion of new members (which took the number of members to just under half of all members of the organisation) and then the members signing up slowed and became almost constant.

However, what I wanted to show in this post was how having a basic knowledge of Unix tools, it is possible to do some reasonably advanced analytics on what may initially seem to be quite unstructured data.

Saturday, March 31, 2012

Let's tawk about AWK

Recently I was at a Python conference and I listened to a speaker tell us that he used to use Fortran, C, Lisp, AWK and a few other languages, but now that he uses Python, he no longer needs to use those other languages.

This surprised me. Although I use Python whenever I can and really enjoy using it, I don't always think it is the best or most suitable language to use for every occasion. In particular, it was AWK which made me think "Why would someone stop using AWK and use Python instead".

I have been meaning to write a short AWK tutorial for months, and recently came across a coworker who I regard as a good programmer who had not heard of AWK which inspired me to start writing this tutorial.

So, in case you've never heard of AWK, what is it? It is a small language that has been around since 1977 and was part of Version 7 Unix.

It is very simple. The structure of an AWK program is generally a series of lines where each line of the series has the following format:

expression { statement }
expression { statement }
...

AWK is like grep in that it looks at each line of a file and processes the file on a line by line basis. However, it will be much easier to demonstrate than explain.

Suppose we have a file with the following contents:

...
9021 2000-12-27 0 0
9021 2000-12-28 0 0
9021 2000-12-29 .6 0
9021 2000-12-30 0 0
9021 2000-12-31 0 0
86071 2000-01-01 .4 0
86071 2000-01-02 0 0
86071 2000-01-03 0 0
86071 2000-01-04 3.4 0
...

The first column (affectionately known as $1 in AWK), is the identifier for a weather station. The second column ($2) is obviously a date, the third column is the amount of rain that fell in the previous 24 hours and the 4th column is a quality flag.

Suppose we wanted to print all records, but only for station 9021, we could do it with the following program:

awk '$1 == 9021 {print $0}'  rain_data.txt
This will give us:
9021 2000-01-01 0 0
9021 2000-01-02 0 0
9021 2000-01-03 0 0
9021 2000-01-04 0 0
9021 2000-01-05 0 0
9021 2000-01-06 0 0
...

In the above example, the expression was '$1 == 9021' and the statement was '{print $0}'. $0 is AWK shorthand for "the whole line".

If we have an expression and we don't have a statement, then the default action is to print the whole line, so we can shorten the above to:

awk '$1 == 9021'  rain_data.txt

Also, we should note that the default field separator is whitespace (it can be one or more spaces or tabs). Now, suppose we wanted to display records for days that actually had some rainfall, we could write:

awk '$3 > 0.0' rain_data.txt
This will give:
9021 2000-01-12 .8 0
9021 2000-01-15 13 0
9021 2000-01-16 5 0
9021 2000-01-17 4.6 0
...
Now suppose we want days where the rainfall was above 10mm and the station was 86071, we could do it as follows:
awk '$3 > 10.0 && $1 == 86071' rain_data.txt
9021 2000-01-12 .8 0
9021 2000-01-15 13 0
9021 2000-01-16 5 0
9021 2000-01-17 4.6 0
...

We have looked at the expressions, so now lets look at the statements. Often we want to print a subset of the fields. Lets print out only the first three fields (in other words omit the quality field). Since we are not printing the quality field, we will select only records where the quality is good (equal to zero).

awk '$3 > 10.0 && $4 == 0 {print $1, $2, $3}' rain_data.txt
This gives:
9021 2000-01-15 13
9021 2000-01-22 60.8
9021 2000-01-23 14.4
9021 2000-03-22 24
...
Now, we will print the rainfall in inches (mm / 25.4)
awk '$3 > 10.0 && $4 == 0 {print $1, $2, $3/25.4}' rain_data.txt
Which gives
9021 2000-01-15 0.511811
9021 2000-01-22 2.3937
9021 2000-01-23 0.566929
9021 2000-03-22 0.944882
...

There are two special expressions in AWK, 'BEGIN' and 'END'. BEGIN matches before the first line of a file has been read and END matches after the last line of text has been read. These are often used for initialisation or printing totals.

As soon as we start doing anything more complex than a single line, it makes sense to create a small script, for example prog.awk, which we then run as follows:

awk -f prog.awk datafile

In the first script, we want to find the maximum rainfall of all the values in our file. We create the following script (maxrain.awk)

    { if ($3 > max) {max = $3} }
END { print max }

The first line does not have an expression, so it matches all records. The if statement says that if the value in the third column (rainfall), is greater than the current value in max, replace it. As mentioned above, 'END' will match after all lines have been processed and will print out the value of max.

We will look at one final aspect of AWK which is associative arrays. An associative array is an array where the index does not have to be an integer (although in this case it will be). We also don't have to declare the size of the array as we do in many other languages.

In this example, we will find the maximum for each of the stations. We do this by creating an array of maximum values where the station number is the index of the array. So, for example, we have station number 86071 and station number 9021. maxarray[86071] will contain the maximum value that we have found so far for that station and maxarray[9021] will contain the maximum value for station 9021.

    { if ($3 > maxarray[$1]) {maxarray[$1] = $3} }
END { for (station in maxarray) print station, maxarray[station] }
Running this we get the following output:
9021 60.8
86071 38.4
...

AWK is a small language and a very powerful language. It is not a general purpose language like Python with a rich library. But I sometimes use it many times a day, because those things that it does well, it does very well. It is also very simple and concise. I rarely need to look up any of the functions of AWK. It is concise enough that many times it can just be typed in from the command line. Any software engineer should have many tools in their toolbelt and choose the best tool for the job. It is surprising how often a little language like AWK will turn out to be that tool.

Friday, March 30, 2012

Quicksort in Python

I have been brushing up on my algorithms. One of the algorithms that I looked at was Quicksort. It is a sorting algorithm developed by Tony Hoare in 1960. If you are using the built in sort function in C, you are using quicksort.

Quicksort is one of those algorithms you study in university and then often forget. However, as I mention, I was brushing up on my algorithms and so decided to revisit quicksort. The way I usually understand algorithms is to first look at the algorithm and then look at an implementation.

Here is a quick description of the algorithm:

Quicksort:
  1. Take an array to be sorted.
  2. If the array has fewer than two values, we will consider it sorted (return).
  3. Choose a value which we will call the pivot value. This can be the first element of the array.
  4. Rearrange the array so that all values less than the pivot value are placed before the pivot value and all other values are placed after the pivot value.
  5. Run Quicksort over each of the two subarrays created in the previous step.
Here is an implemenation in C (from http://rosettacode.org/wiki/Quicksort):

void quick_sort (int *a, int n) {
    if (n < 2)
        return;
    int p = a[n / 2];
    int *l = a;
    int *r = a + n - 1;
    while (l <= r) {
        while (*l < p)
            l++;
        while (*r > p)
            r--;
        if (l <= r) {
            int t = *l;
            *l++ = *r;
            *r-- = t;
        }
    }
    quick_sort(a, r - a + 1);
    quick_sort(l, a + n - l);
}
 
int main () {
    int a[] = {4, 65, 2, -31, 0, 99, 2, 83, 782, 1};
    int n = sizeof a / sizeof a[0];
    quick_sort(a, n);
    return 0;
}

While we can see what the above code is doing, and we can map the variable names to the description from the algorithm, it is not immediately clear how the algorithm works.

In fact, if we replaced the name quick_sort by q, and asked someone what the code was doing, it would be reasonably hard for most people to guess that it was a quicksort.

The next thing I did was to look for a Python implementation and found this one (http://en.literateprograms.org/Quicksort_(Python)?printable=yes):

def qsort1(list):
    """
    Quicksort using list comprehensions
    """
    if list == []: 
        return []
    else:
        pivot = list[0]
        lesser = qsort1([x for x in list[1:] if x < pivot])
        greater = qsort1([x for x in list[1:] if x >= pivot])
        return lesser + [pivot] + greater

Immediately it is obvious what the algorithm is doing. Looking at the algorithm, though, we notice that the first line of the function:

if list == []

Looks for an empty list. This is because if we get an empty list, we can assume it is sorted. However, if we get a list with only one element, we can also assume it is sorted. I changed the lines to:

if len(list) <= 1: 
    return list

I measured the speed of the two algorithms and the second gave me a 10% improvement in speed. Would I have noticed this so easily in the C code? I am not sure. However, having noticed this in the Python code, it seemed fairly obvious that it could be improved and I also felt confident that my improvement was going to work.

So, even though Python may not be as efficient as C or some compiled languages, the clarity of the code certainly allows us to make improvements and optimisations that may otherwise have been very difficult.

Saturday, July 30, 2011

Secret Santa and Derangements






Here's an interesting question. Suppose you're holding a Secret Santa - you write down the names of your friends and yourself and put all the names in a hat. Each person pulls a name out of the hat and they must buy a present for the person whose name they pull out. What is the chance that someone will get their own name?

I recently discovered that if we permute an ordered set in such a way that no element is in its original position, this is called a derangement. So, we can rephrase the question in the first paragraph as what is the chance of a derangement.

If we have one name in our hat, then the chance of a derangement is 0. If we have two names, the chances are 50%. If we have 3 names, then there are 6 permutations. Of these, 4 of them are derangements, so we have a 2/3 chance of a derangement. As we have more names, things get a little complicated. We should be able to use mathematics to solve this, but since I am more adept with writing software than with mathematics, I thought I would write a simulation in Python to calculate the probability and then ask my friend and excellent mathematician, Alasdair, to do the mathematical analysis. So, without further ado, here is a Python program which simulates a Secret Santa for 10, 20, 30, 40, 50 and 100 people and runs
each simulation 10,000 times.

'''
Write a Monte Carlo simulation to estimate the chance
of a derangement.
'''

import random

def event(n):
    '''
    Given n people, create an array of n elements
    shuffle the array and then return True if it's
    a derangement, else False.
    '''
    a = [i for i in range(n)]
    random.shuffle(a)

    for i in xrange(n):
        if i == a[i]:
            return False
    return True

simulations = 10000
for people in [10, 20, 30, 40, 50, 100]:
    derangements = 0
    for i in range(simulations):
        if event(people):
            derangements += 1

print 'People: %d  Derangements: %f' % (
    people, float(derangements) / simulations)


Running this program, we get (maybe, because we get different results each time we run it):
People: 10  Derangements: 0.371900
People: 20  Derangements: 0.370300
People: 30  Derangements: 0.370500
People: 40  Derangements: 0.368200
People: 50  Derangements: 0.360000
People: 100  Derangements: 0.372100

It is hard to tell from these results whether we are converging to a value. Ideally we should be doing about a million events for each simulation and using maybe a million people (we are trying to find out where the series converges when the amount of people approaches infinity). I am sure that there will be some very simple value for the final answer. So, over to Alasdair...

To look at the mathematics behind derangements, lets call the number of derangements of n objects D(n). A bit of fiddling will show that D(2)=1, D(3)=2 and D(4)=9. For example, to show that D(4)=9, you can list all the permutations of 4 objects, and find which of those are derangements:

1 2 3 4
1 2 4 3
1 3 2 4
1 3 4 2
1 4 2 3
1 4 3 2
2 1 3 4
2 1 4 3 *
2 3 1 4
2 3 4 1 *
2 4 1 3 *
2 4 3 1
3 2 1 4
3 2 4 1
3 1 4 2 *
3 1 2 4
3 4 1 2 *
3 4 2 1 *
4 1 2 3 *
4 1 3 2
4 2 1 3
4 2 3 1
4 3 1 2 *
4 3 2 1 *

If you like a hard time, then you can find that D(5)=44. Beyond that the amount of work becomes too big for this sort of brute force calculation.

So then, without listing all permutations, how could we go about finding D(5)? Well, start with five objects


1 2 3 4 5


Clearly we can't keep just one object in place and permute the other four. So we have to change the positions of at least two objects, and permute the other three. For example, we could swap 1 and 2:


2 1 * * *


and then there are two derangements of the other three. And in fact this gives as a method of counting all derangements. The number 1 can also swapped with three other numbers:
3 1 * * *
4 1 * * *
5 1 * * *

and in each case the derangements of the rest can be counted. How might we count the number of derangements in the case where say, 4 is in the first place? Like this:


4 * * * *


We consider two cases:
  1. The 4 and the 1 are swapped: 4 * * 1 *
  2. The 1 goes into some other place, such as 4 * 1 * *

Now to count the number of derangements in each case.
  1. The other numbers 2, 3, and 5, have not been moved. So in this case the total number of derangements is the number of derangements of three objects: D(3)=2.
  2. Suppose we try to count the number of derangements now where the 1 is not in the fourth place. This means we have:


    Original: 1 2 3 4 5
    After the swap: 4 2 3 1 5


    Forgetting about the initial four, the number of derangements is the number of derangements of the other four: this will ensure that 1 does not end up in the fourth place. And this number is D(4)=9.

What we have shown now is that the number of derangements with 4 in the first place is D(3)+D(4)=2+9=11. But there are four possible values for the first place (every number other than 1), so the total number of derangements is 4 x (D(3)+D(4)) = 4 x 11 =44. And this is the value of D(5). Applying this argument to the general case leads to the recursive formula



D(n)=(n-1)(D(n-1)+D(n-2))


The next few values are easily determined: D(6)=5(D(4)+D(5))=5(9+44)=265. Then D(7)=6(D(5)+D(6))=6(44+265)=1854. And so on.



The question now is to determine a non-recursive formula. And for this we are going to use the inclusion-exclusion principle, which you may have seen in relation to the sizes of set intersections and unions. For two sets, it says that: $$|A\cup B|=|A|+|B|-|A\cap B|$$ That is, the number of elements in a set union is the sum of the set cardinalities, minus the cardinality of the set intersection. The reason for subtracting the set intersection is that when the cardinalities of the two sets are added, each element in the union is counted twice.

For three sets, we have $$|A\cup B\cup C|=|A|+B|+|C|-|A\cap B|-|A\cap C|-|B\cap C|+|A\cap B\cap C|$$ So first the (cardinalities of the) sets are added; then the intersections of all pairs of sets subtracted, and finally the intersection of all sets added. And this principle can be extended to an arbitrary number of sets. First add the cardinalities of all the sets, then subtract the cardinalities of all intersections of pairs of sets, then add the cardinalities of all intersections of triples of sets, then subtract the cardinalities of all intersections of quadruples of sets, then add... you get the idea!

What does this have to do with derangements? Well, consider all the derangements of four objects, and let $$A$$ be the set in which 1 is in its correct place, $$B$$ the set for which 2 is in the correct place, and $$C$$ and $$D$$ set sets with 3 and 4 in the correct place respectively. We can now use the inclusion-exclusion principle to determine the number of permutation where at least one element is in the correct place: that is, the size of the union of $$A$$, $$B$$, $$C$$ and $$D$$. Now the size of each individual set is $$3!$$. The size of each intersection of two sets is $$2!$$ (two elements are fixed, leaving two to move), and there are

$${4\choose 2}=6$$
possible pairs. The size of each intersection of three sets is $$1!$$ (three elements are fixed, leaving only one to move), and there are

$${4\choose 3}=4$$
possible triples. Thus we have

$$A\cup B\cup C\cup D = 4(3!)-{4\choose 2}2!+{4\choose 3}1!+{4\choose 4}0!$$.
Writing each binomial coefficient in terms f factorials and simplifying, this sum becomes

$$4!-\frac{4!}{2!}+\frac{4!}{3!}-\frac{4!}{4!}$$
or as

$$4!(\frac{1}{1!}-\frac{1}{2!}+\frac{1}{3!}-\frac{1}{4!})$$.
But this value (call it $$v$$) is the number of permutations in which at least one element is fixed. The number of derangements is $$4!-v$$, or

$$4!(1-\frac{1}{1!}+\frac{1}{2!}-\frac{1}{3!}+\frac{1}{4!})$$.
In general, then:

$$D(n)=n!(1-\frac{1}{1!}+\frac{1}{2!}-\frac{1}{3!}+\frac{1}{4!}-\cdots +(-1)^n\frac{1}{n!}).$$
If we divide both sides by $$n!$$, and take the limit as n rockets off to infinity, we have

$$\lim_{n\to\infty}\frac{D(n)}{n!}=1-\frac{1}{1!}+\frac{1}{2!}-\frac{1}{3!}+\frac{1}{4!}-\cdots$$
and the right hand side can be seen to be $$e^{-1}\approx 0.36787944$$, which is what was obtained above by experiment.

Saturday, June 4, 2011

Sorting without sorting

It is Sunday, the children have gone to visit their grandparents and I should be writing an essay this afternoon for my Graduate Certificate in management.

Instead I was browsing the web and came across a Kata by Dave Thomas from Pragmatic
Programmers.

The idea of Katas is that they are exercises that allow programmers to keep
their skills in tune by continually solving small programming (or sometimes
other) problems. In this particular kata, Dave offers the following problem:

Our resident conspiracy expert, Dr. X, is looking for hidden messages in the
collected publications of Hugh Hefner. Dr. X believes the message is hidden in
the individual letters, so rather than get distracted by the words, he’s asked
us to write a program to take a block of text and return the letters it
contains, sorted. Given the text:

When not studying nuclear physics, Bambi likes to play
beach volleyball.


our program would return:

aaaaabbbbcccdeeeeeghhhiiiiklllllllmnnnnooopprsssstttuuvwyyyy


The program ignores punctuation, and maps upper case to lower case.

Are there any ways to perform this sort cheaply, and without using built-in libraries?


I decided to try a solution quickly in Python and came up with the following
solution:

sentence = '''
   When not studying nuclear physics, Bambi likes to play
   beach volleyball."
'''

frequency = [0] * 26

def increment_frequency(letter):
    frequency[ord(letter) - ord('a')] += 1

[increment_frequency(letter) for letter in sentence.lower() if letter >= 'a' 
    and letter <= 'z']

print ''.join([chr(i + ord('a')) * frequency[i] for i in
    range(len(frequency))])


What I have done is to create an array of 26 elements, one for each letter of the alphabet, and each time I see a letter, I increment the count for that letter. In other words, we don't sort the letters, we just keep track of the frequency of each letter. Finally, we print out the contents of the array as a histogram, using the letter to represent the bar.

I did not come up with the solution - I am sure that I have seen it before. Probably more than once. But I think doing the katas is a good way to keep that sort of thing in front of the mind. Also, while doing it, I made the mistake that I often make of using iter.join(str), rather than using str.join(iter).

Okay, now on to my essay...

Monday, April 11, 2011

Miller-Rabin to the rescue

I love words and I love primes, so when I came across this article on prime words, naturally I was tempted to try to implement it in Python.

In case you can't click through to the article, here is what we want to do. In base 36 every letter of the alphabet represents a digit. So every word is a number. The digit A has the value 11, B has the value 12 and D has the value 14, so BAD would represent 12*(36^2) + 11*(36^1) + 14*(36^0).

This value may or may not be prime. Our program will read all the words in a dictionary and print out the words which have a prime value.

Here is my first (naive) attempt:

#
# Read a dictionary and print out all words which are prime base 36.
#

import math

def isprime(num):
    '''
    Return true if a number is prime, else false.
    '''

    # most of the numbers will be tested by the next four lines
    if num == 2 or num == 3 or num == 5:
        return True
    if num % 2 == 0 or num % 3 == 0 or num % 5 == 0:
        return False
    
    # For the remainder...
    n = 5
    while (n <= math.sqrt(num)):
        if num % n == 0:
            return False
        n += 2
    return True


for word in open('/usr/share/dict/words'):
    if isprime(int(word.rstrip().upper(), 36)):
        print(word)

So, what is wrong with it? Well, I started running it and I started to see a few words, for example:
Ababdeh
abacist
abandon
But then things slowed down. As soon as I thought about it for a few seconds, I realised that some words probably have quite high integer values. For example, let's take ankyloblepharon (the condition where your eyelids are stuck together). Let's convert this to an integer base 64:
>>> int('ankyloblepharon', 36)
65432123906951596638119L
Okay, that's a pretty big number. Fortunately we only test half the numbers in the range. That's still pretty big. Even more fortunate, we only test up the the square root of the number which gives us:
>>>  math.sqrt(int('ankyloblepharon', 36) /2)
180875819150.80798
Adding a few commas, we get: 180,875,819,150 or to put it more simply, 180 billion tests. Clearly we need something more efficient. Luckily there is a test for primality called Miller-Rabin which is a "probabilistic" test. We test a number to see if it's prime. If the test tells us the number is composite, we are sure it is composite. If the test tells us the number is prime, we are pretty sure it is prime. Depending on the parameter we use, pretty sure can mean something like 99.999999999999% sure. I used the Wikipedia definition of Miller-Rabin. Here is the revised code:
#
# Read a dictionary and print out all words which are prime base 36.
#

import math


#
# Implementation of Miller-Rabin based on Wikipedia algorithm
#
# Algorithm:
# Input: n > 3, an odd integer to be tested for primality;
# Input: k, a parameter that determines the accuracy of the test
# Output: composite if n is composite, otherwise probably prime
# write n - 1 as 2^s*d with d odd by factoring powers of 2 from n - 1
# LOOP: repeat k times:
#   pick a random integer a in the range [2, n - 2]
#   x <- a^d mod n
#   if x = 1 or x = n - 1 then do next LOOP
#   for r = 1 .. s - 1
#      x <- x2 mod n
#      if x = 1 then return composite
#      if x = n - 1 then do next LOOP
#   return composite
#return probably prime
# 
#
import random

def inner_test(n, s, x):
    for r in range(1, s):
        x = pow(x, 2, n)
        if x == 1:
            return False
        if x == n - 1:
            return True 
    return False 


def miller_rabin(n):
    if n < 2:
        return False

    if n % 2 == 0:
        return n == 2

    k = 30 # Number of tests
    s = 0 
    d = n - 1
    while d % 2 == 0:
        s += 1
        d = d / 2
    for i in range(k):
        a = random.randint(2, n - 1)
        x = pow(a, d, n) 
       
        if x == 1 or x == n - 1:
            continue
        
        if not inner_test(n, s, x):
            return False
    return True



for word in open('/usr/share/dict/words'):
    if miller_rabin(int(word.rstrip().upper(), 36)):
        print(word)


So, how much better was the revised code? Well, in less time than it took to print the first four words using the first test, the revised test found 12,000+ words.

Interestingly, there is also a deterministic version of Miller-Rabin. According to the Wikipedia article, we can test numbers up to 341,550,071,728,321 using only the following numbers: 2, 3, 5, 7, 11, 13 and 17. I have a feeling that limit should cover most of the words in our dictionary, but that is a task for another day. I think a speed up of about 4,000 times is sufficient for one day.

Tuesday, March 1, 2011

When programming languages are being discussed on forums, one of the topics that is almost always an issue is the kind of community in which the language exists (and I am using the idea of the language existing within a community quite deliberately). I have found that many languages have quite a vibrant community where people are willing to share thoughts and ideas and are supportive of those who are learning the language. A few weeks ago, though, I was lucky to see first hand the excellent spirit of the Python community. David Beazley, Python expert, author of SWIG, the Python Essential Reference and all round nice guy was travelling to Canberra, Australia to do some training. He offered, at no expense, to visit Melbourne and give us a sneak preview of the workshop that he will be delivering at PyCon2011. The workshop was pretty well attended for a workshop on a Saturday that was held at very short notice. The workshop was titled 'Mastering Python 3 I/O' and my first response was how much can we talk about I/O in Python 3 that I don't already know. After all, we all know that 'print x' has been replaced by 'print(x)'. Well, as it turns out, there was a lot that I don't know. Of the almost 150 slides that we covered, I found myself taking a note or two almost every second or third slide. The talk was broken down into 8 easily digestible parts. It started with simple issues like porting from 2 to 3 (and why 2to3 cannot cope with many I/O issues), progressed to a very nice overview on Unicode (an understanding of which is important in understanding Python 3) and continued through to system interfaces and library design issues. Here are just a few of my dotpoints from the talk (I am tempted to include more, but do not want to preempt Dave's talk at PyCon):

* All strings are Unicode strings in Python 3.
* Most of the Unicode issues are handled transparently.
* 2to3 will not address many I/O issues in migration.
* Python 3 reimplements most of the I/O stack.
* Python 3 has introduced a new formatting model.
* Much of this model has been backported to Python (in particular, 2.7).
* There are potentially lots of gotchas in Python 3. Forewarned is forearmed.

I think it is a bit unfortunate that Python 3 is mentioned in the title. This talk is relevant to all Python programmers - even if you don't intend using Python 3 for a while. If you're attending PyCon and have a chance to attend Dave's workshop, I can recommend it. If you don't get a chance, hopefully the slides will be available. Finally, thanks Dave. You're a credit to the Python community and all of us in Melbourne enjoyed your breezy Chicago style.

Note: After the conference, the slides should be available here: http://www.dabeaz.com/talks.html