mudeltasigma: 2011

Saturday, July 30, 2011

Secret Santa and Derangements

Here's an interesting question. Suppose you're holding a Secret Santa - you write down the names of your friends and yourself and put all the names in a hat. Each person pulls a name out of the hat and they must buy a present for the person whose name they pull out. What is the chance that someone will get their own name?

I recently discovered that if we permute an ordered set in such a way that no element is in its original position, this is called a derangement. So, we can rephrase the question in the first paragraph as what is the chance of a derangement.

If we have one name in our hat, then the chance of a derangement is 0. If we have two names, the chances are 50%. If we have 3 names, then there are 6 permutations. Of these, 4 of them are derangements, so we have a 2/3 chance of a derangement. As we have more names, things get a little complicated. We should be able to use mathematics to solve this, but since I am more adept with writing software than with mathematics, I thought I would write a simulation in Python to calculate the probability and then ask my friend and excellent mathematician, Alasdair, to do the mathematical analysis. So, without further ado, here is a Python program which simulates a Secret Santa for 10, 20, 30, 40, 50 and 100 people and runs
each simulation 10,000 times.

'''
Write a Monte Carlo simulation to estimate the chance
of a derangement.
'''

import random

def event(n):
    '''
    Given n people, create an array of n elements
    shuffle the array and then return True if it's
    a derangement, else False.
    '''
    a = [i for i in range(n)]
    random.shuffle(a)

    for i in xrange(n):
        if i == a[i]:
            return False
    return True

simulations = 10000
for people in [10, 20, 30, 40, 50, 100]:
    derangements = 0
    for i in range(simulations):
        if event(people):
            derangements += 1

print 'People: %d  Derangements: %f' % (
    people, float(derangements) / simulations)

Running this program, we get (maybe, because we get different results each time we run it):

People: 10  Derangements: 0.371900
People: 20  Derangements: 0.370300
People: 30  Derangements: 0.370500
People: 40  Derangements: 0.368200
People: 50  Derangements: 0.360000
People: 100  Derangements: 0.372100

It is hard to tell from these results whether we are converging to a value. Ideally we should be doing about a million events for each simulation and using maybe a million people (we are trying to find out where the series converges when the amount of people approaches infinity). I am sure that there will be some very simple value for the final answer. So, over to Alasdair...

To look at the mathematics behind derangements, lets call the number of derangements of n objects D(n). A bit of fiddling will show that D(2)=1, D(3)=2 and D(4)=9. For example, to show that D(4)=9, you can list all the permutations of 4 objects, and find which of those are derangements:

1	2	3	4
1	2	4	3
1	3	2	4
1	3	4	2
1	4	2	3
1	4	3	2
2	1	3	4
2	1	4	3	*
2	3	1	4
2	3	4	1	*
2	4	1	3	*
2	4	3	1
3	2	1	4
3	2	4	1
3	1	4	2	*
3	1	2	4
3	4	1	2	*
3	4	2	1	*
4	1	2	3	*
4	1	3	2
4	2	1	3
4	2	3	1
4	3	1	2	*
4	3	2	1	*

If you like a hard time, then you can find that D(5)=44. Beyond that the amount of work becomes too big for this sort of brute force calculation.

So then, without listing all permutations, how could we go about finding D(5)? Well, start with five objects

1 2 3 4 5

Clearly we can't keep just one object in place and permute the other four. So we have to change the positions of at least two objects, and permute the other three. For example, we could swap 1 and 2:

2 1 * * *

and then there are two derangements of the other three. And in fact this gives as a method of counting all derangements. The number 1 can also swapped with three other numbers:

3	1	*	*	*
4	1	*	*	*
5	1	*	*	*

and in each case the derangements of the rest can be counted. How might we count the number of derangements in the case where say, 4 is in the first place? Like this:

4 * * * *

We consider two cases:

The 4 and the 1 are swapped: 4 * * 1 *
The 1 goes into some other place, such as 4 * 1 * *

Now to count the number of derangements in each case.

The other numbers 2, 3, and 5, have not been moved. So in this case the total number of derangements is the number of derangements of three objects: D(3)=2.
Suppose we try to count the number of derangements now where the 1 is not in the fourth place. This means we have:

Original: 1 2 3 4 5

After the swap: 4 2 3 1 5

Forgetting about the initial four, the number of derangements is the number of derangements of the other four: this will ensure that 1 does not end up in the fourth place. And this number is D(4)=9.

What we have shown now is that the number of derangements with 4 in the first place is D(3)+D(4)=2+9=11. But there are four possible values for the first place (every number other than 1), so the total number of derangements is 4 x (D(3)+D(4)) = 4 x 11 =44. And this is the value of D(5). Applying this argument to the general case leads to the recursive formula

D(n)=(n-1)(D(n-1)+D(n-2))

The next few values are easily determined: D(6)=5(D(4)+D(5))=5(9+44)=265. Then D(7)=6(D(5)+D(6))=6(44+265)=1854. And so on.

The question now is to determine a non-recursive formula. And for this we are going to use the inclusion-exclusion principle, which you may have seen in relation to the sizes of set intersections and unions. For two sets, it says that: $$|A\cup B|=|A|+|B|-|A\cap B|$$ That is, the number of elements in a set union is the sum of the set cardinalities, minus the cardinality of the set intersection. The reason for subtracting the set intersection is that when the cardinalities of the two sets are added, each element in the union is counted twice.

For three sets, we have $$|A\cup B\cup C|=|A|+B|+|C|-|A\cap B|-|A\cap C|-|B\cap C|+|A\cap B\cap C|$$ So first the (cardinalities of the) sets are added; then the intersections of all pairs of sets subtracted, and finally the intersection of all sets added. And this principle can be extended to an arbitrary number of sets. First add the cardinalities of all the sets, then subtract the cardinalities of all intersections of pairs of sets, then add the cardinalities of all intersections of triples of sets, then subtract the cardinalities of all intersections of quadruples of sets, then add... you get the idea!

What does this have to do with derangements? Well, consider all the derangements of four objects, and let $$A$$ be the set in which 1 is in its correct place, $$B$$ the set for which 2 is in the correct place, and $$C$$ and $$D$$ set sets with 3 and 4 in the correct place respectively. We can now use the inclusion-exclusion principle to determine the number of permutation where at least one element is in the correct place: that is, the size of the union of $$A$$, $$B$$, $$C$$ and $$D$$. Now the size of each individual set is $$3!$$. The size of each intersection of two sets is $$2!$$ (two elements are fixed, leaving two to move), and there are

$${4\choose 2}=6$$
possible pairs. The size of each intersection of three sets is $$1!$$ (three elements are fixed, leaving only one to move), and there are

$${4\choose 3}=4$$
possible triples. Thus we have

$$A\cup B\cup C\cup D = 4(3!)-{4\choose 2}2!+{4\choose 3}1!+{4\choose 4}0!$$.
Writing each binomial coefficient in terms f factorials and simplifying, this sum becomes

$$4!-\frac{4!}{2!}+\frac{4!}{3!}-\frac{4!}{4!}$$
or as

$$4!(\frac{1}{1!}-\frac{1}{2!}+\frac{1}{3!}-\frac{1}{4!})$$.
But this value (call it $$v$$) is the number of permutations in which at least one element is fixed. The number of derangements is $$4!-v$$, or

$$4!(1-\frac{1}{1!}+\frac{1}{2!}-\frac{1}{3!}+\frac{1}{4!})$$.
In general, then:

$$D(n)=n!(1-\frac{1}{1!}+\frac{1}{2!}-\frac{1}{3!}+\frac{1}{4!}-\cdots +(-1)^n\frac{1}{n!}).$$
If we divide both sides by $$n!$$, and take the limit as n rockets off to infinity, we have

$$\lim_{n\to\infty}\frac{D(n)}{n!}=1-\frac{1}{1!}+\frac{1}{2!}-\frac{1}{3!}+\frac{1}{4!}-\cdots$$
and the right hand side can be seen to be $$e^{-1}\approx 0.36787944$$, which is what was obtained above by experiment.

Saturday, June 4, 2011

Sorting without sorting

It is Sunday, the children have gone to visit their grandparents and I should be writing an essay this afternoon for my Graduate Certificate in management.

Instead I was browsing the web and came across a Kata by Dave Thomas from Pragmatic
Programmers.

The idea of Katas is that they are exercises that allow programmers to keep
their skills in tune by continually solving small programming (or sometimes
other) problems. In this particular kata, Dave offers the following problem:

Our resident conspiracy expert, Dr. X, is looking for hidden messages in the
collected publications of Hugh Hefner. Dr. X believes the message is hidden in
the individual letters, so rather than get distracted by the words, he’s asked
us to write a program to take a block of text and return the letters it
contains, sorted. Given the text:

When not studying nuclear physics, Bambi likes to play
beach volleyball.

our program would return:

aaaaabbbbcccdeeeeeghhhiiiiklllllllmnnnnooopprsssstttuuvwyyyy

The program ignores punctuation, and maps upper case to lower case.

Are there any ways to perform this sort cheaply, and without using built-in libraries?

I decided to try a solution quickly in Python and came up with the following
solution:

sentence = '''
   When not studying nuclear physics, Bambi likes to play
   beach volleyball."
'''

frequency = [0] * 26

def increment_frequency(letter):
    frequency[ord(letter) - ord('a')] += 1

[increment_frequency(letter) for letter in sentence.lower() if letter >= 'a' 
    and letter <= 'z']

print ''.join([chr(i + ord('a')) * frequency[i] for i in
    range(len(frequency))])

What I have done is to create an array of 26 elements, one for each letter of the alphabet, and each time I see a letter, I increment the count for that letter. In other words, we don't sort the letters, we just keep track of the frequency of each letter. Finally, we print out the contents of the array as a histogram, using the letter to represent the bar.

I did not come up with the solution - I am sure that I have seen it before. Probably more than once. But I think doing the katas is a good way to keep that sort of thing in front of the mind. Also, while doing it, I made the mistake that I often make of using iter.join(str), rather than using str.join(iter).

Okay, now on to my essay...

Monday, April 11, 2011

Miller-Rabin to the rescue

I love words and I love primes, so when I came across this article on prime words, naturally I was tempted to try to implement it in Python.

In case you can't click through to the article, here is what we want to do. In base 36 every letter of the alphabet represents a digit. So every word is a number. The digit A has the value 11, B has the value 12 and D has the value 14, so BAD would represent 12*(36^2) + 11*(36^1) + 14*(36^0).

This value may or may not be prime. Our program will read all the words in a dictionary and print out the words which have a prime value.

Here is my first (naive) attempt:

#
# Read a dictionary and print out all words which are prime base 36.
#

import math

def isprime(num):
    '''
    Return true if a number is prime, else false.
    '''

    # most of the numbers will be tested by the next four lines
    if num == 2 or num == 3 or num == 5:
        return True
    if num % 2 == 0 or num % 3 == 0 or num % 5 == 0:
        return False
    
    # For the remainder...
    n = 5
    while (n <= math.sqrt(num)):
        if num % n == 0:
            return False
        n += 2
    return True


for word in open('/usr/share/dict/words'):
    if isprime(int(word.rstrip().upper(), 36)):
        print(word)

So, what is wrong with it? Well, I started running it and I started to see a few words, for example:

Ababdeh
abacist
abandon

But then things slowed down. As soon as I thought about it for a few seconds, I realised that some words probably have quite high integer values. For example, let's take ankyloblepharon (the condition where your eyelids are stuck together). Let's convert this to an integer base 64:

>>> int('ankyloblepharon', 36)
65432123906951596638119L

Okay, that's a pretty big number. Fortunately we only test half the numbers in the range. That's still pretty big. Even more fortunate, we only test up the the square root of the number which gives us:

>>>  math.sqrt(int('ankyloblepharon', 36) /2)
180875819150.80798

Adding a few commas, we get: 180,875,819,150 or to put it more simply, 180 billion tests. Clearly we need something more efficient. Luckily there is a test for primality called Miller-Rabin which is a "probabilistic" test. We test a number to see if it's prime. If the test tells us the number is composite, we are sure it is composite. If the test tells us the number is prime, we are pretty sure it is prime. Depending on the parameter we use, pretty sure can mean something like 99.999999999999% sure. I used the Wikipedia definition of Miller-Rabin. Here is the revised code:

#
# Read a dictionary and print out all words which are prime base 36.
#

import math


#
# Implementation of Miller-Rabin based on Wikipedia algorithm
#
# Algorithm:
# Input: n > 3, an odd integer to be tested for primality;
# Input: k, a parameter that determines the accuracy of the test
# Output: composite if n is composite, otherwise probably prime
# write n - 1 as 2^s*d with d odd by factoring powers of 2 from n - 1
# LOOP: repeat k times:
#   pick a random integer a in the range [2, n - 2]
#   x <- a^d mod n
#   if x = 1 or x = n - 1 then do next LOOP
#   for r = 1 .. s - 1
#      x <- x2 mod n
#      if x = 1 then return composite
#      if x = n - 1 then do next LOOP
#   return composite
#return probably prime
# 
#
import random

def inner_test(n, s, x):
    for r in range(1, s):
        x = pow(x, 2, n)
        if x == 1:
            return False
        if x == n - 1:
            return True 
    return False 


def miller_rabin(n):
    if n < 2:
        return False

    if n % 2 == 0:
        return n == 2

    k = 30 # Number of tests
    s = 0 
    d = n - 1
    while d % 2 == 0:
        s += 1
        d = d / 2
    for i in range(k):
        a = random.randint(2, n - 1)
        x = pow(a, d, n) 
       
        if x == 1 or x == n - 1:
            continue
        
        if not inner_test(n, s, x):
            return False
    return True



for word in open('/usr/share/dict/words'):
    if miller_rabin(int(word.rstrip().upper(), 36)):
        print(word)

So, how much better was the revised code? Well, in less time than it took to print the first four words using the first test, the revised test found 12,000+ words.

Interestingly, there is also a deterministic version of Miller-Rabin. According to the Wikipedia article, we can test numbers up to 341,550,071,728,321 using only the following numbers: 2, 3, 5, 7, 11, 13 and 17. I have a feeling that limit should cover most of the words in our dictionary, but that is a task for another day. I think a speed up of about 4,000 times is sufficient for one day.

Tuesday, March 1, 2011

When programming languages are being discussed on forums, one of the topics that is almost always an issue is the kind of community in which the language exists (and I am using the idea of the language existing within a community quite deliberately). I have found that many languages have quite a vibrant community where people are willing to share thoughts and ideas and are supportive of those who are learning the language. A few weeks ago, though, I was lucky to see first hand the excellent spirit of the Python community. David Beazley, Python expert, author of SWIG, the Python Essential Reference and all round nice guy was travelling to Canberra, Australia to do some training. He offered, at no expense, to visit Melbourne and give us a sneak preview of the workshop that he will be delivering at PyCon2011. The workshop was pretty well attended for a workshop on a Saturday that was held at very short notice. The workshop was titled 'Mastering Python 3 I/O' and my first response was how much can we talk about I/O in Python 3 that I don't already know. After all, we all know that 'print x' has been replaced by 'print(x)'. Well, as it turns out, there was a lot that I don't know. Of the almost 150 slides that we covered, I found myself taking a note or two almost every second or third slide. The talk was broken down into 8 easily digestible parts. It started with simple issues like porting from 2 to 3 (and why 2to3 cannot cope with many I/O issues), progressed to a very nice overview on Unicode (an understanding of which is important in understanding Python 3) and continued through to system interfaces and library design issues. Here are just a few of my dotpoints from the talk (I am tempted to include more, but do not want to preempt Dave's talk at PyCon):

* All strings are Unicode strings in Python 3.
* Most of the Unicode issues are handled transparently.
* 2to3 will not address many I/O issues in migration.
* Python 3 reimplements most of the I/O stack.
* Python 3 has introduced a new formatting model.
* Much of this model has been backported to Python (in particular, 2.7).
* There are potentially lots of gotchas in Python 3. Forewarned is forearmed.

I think it is a bit unfortunate that Python 3 is mentioned in the title. This talk is relevant to all Python programmers - even if you don't intend using Python 3 for a while. If you're attending PyCon and have a chance to attend Dave's workshop, I can recommend it. If you don't get a chance, hopefully the slides will be available. Finally, thanks Dave. You're a credit to the Python community and all of us in Melbourne enjoyed your breezy Chicago style.

Note: After the conference, the slides should be available here: http://www.dabeaz.com/talks.html

Sunday, February 6, 2011

4x4s in Python

I recently came across a nice little game. As far as I can tell, you are required to make as many numbers as possible using only four 4s. So, for example:
4 * 4 / 4 * 4 = 1
(4 / 4) + (4 / 4) = 2
(4 + 4 + 4) / 4 = 3
etc.

I played the game until I got up to about 5 then thought it would be much more rewarding to write a little Python program to see how many numbers under 100 I could generate using only four 4s.

There may be a better way to do this, but my simple approach was to generate all the permutations of four 4s together with the permissible set of operators (ie. +, -, x, / and ^). This gives us 7 characters. However, we also need to consider parentheses since (4 + 4 + 4) / 4 gives a very different result to 4 + 4 + (4 / 4). As soon as we include parentheses, we start getting into 'intractable land'. In other words, the number of permutations that we can generate becomes impractical. As anyone who has ever owned an HP calculator (the real ones manufactured in the 1980s) knows, any parenthesised expression can be expressed in Reverse Polish Notation without parentheses. Very briefly, this means that we use a stack on push operands onto the stack. As soon as we encounter an operator (we will assume we only have binary operators), we pop two values from the stack, apply the operator and push the result back onto the stack. We can now express the previous two examples using RPN (Reverse Polish Notation).

(4 + 4 + 4) / 4 becomes 444++4/

and

4 + 4 + (4 / 4) becomes 4444/++

Fortunately, writing a function to evaluate an RPN expression is straightforward. Here is one implementation.

def eval_rpn(exp):
    stack = []
    for ch in exp:
        if isdigit(ch):
            stack.append(int(ch))
        if isoperator(ch):
            if len(stack) < 2:
                return None
            rhs = stack.pop()
            lhs = stack.pop()
            if ch == '+':
                stack.append(lhs + rhs)
            elif ch == '-':
                stack.append(lhs - rhs)
            elif ch == '*':
                stack.append(lhs * rhs)
            elif ch == '/':
                if rhs == 0:
                    return None
                stack.append(lhs / rhs)
            elif ch == '^':
                stack.append(lhs ** rhs)
    result = stack.pop()
    return result

Once we have this, the rest is easy. We just generate all permutations of the four digits and the allowable operators. We end up with the following program:

'''
A program to evaluate each combination of 4 digits
'''

import itertools

operators = ['+','-','*','/','^']
digits = ['4','4','4','4']


pattern = operators + digits

results = {}

def isdigit(ch):
    return ch in '0123456789'


def isoperator(ch):
    return ch in '+-*/^'

def eval_rpn(exp):
    stack = []
    for ch in exp:
        if isdigit(ch):
            stack.append(int(ch))
        if isoperator(ch):
            if len(stack) < 2:
                return None
            rhs = stack.pop()
            lhs = stack.pop()
            if ch == '+':
                stack.append(lhs + rhs)
            elif ch == '-':
                stack.append(lhs - rhs)
            elif ch == '*':
                stack.append(lhs * rhs)
            elif ch == '/':
                if rhs == 0:
                    return None
                stack.append(lhs / rhs)
            elif ch == '^':
                stack.append(lhs ** rhs)
    result = stack.pop()
    return result


# Try every permutation of 7 characters...
for exp in itertools.permutations(pattern, 7):
    e = ''.join(exp)
    val = eval_rpn(e)
    if val is not None and int(val) == val and val >= 0:
        results[val] = ''.join(e) 

for i in range (1, 100):
    if i in results:
        print results[i], ' = ', i

When we run the program, we get the following output:

4444*/^  =  1
444+4/-  =  2
444/4^-  =  3
4444^/-  =  4
4444-^+  =  5
4444/-+  =  7
4444/^+  =  8
4444/-*  =  12
44*44/-  =  15
4444/^*  =  16
44/44*+  =  17
4444/+*  =  20
444+*4-  =  28
44^44+/  =  32
44^4/4-  =  60
44^4-4/  =  63
4444/-^  =  64
444^+4/  =  65
444^4/+  =  68
444/-4^  =  81

After doing this, I noticed that we had more gaps than I expected. I reread the rules of the game and realised that an expression like 44 + 44 is also valid. However, I don't think it's a hard feature to add (or is it?). If I was going to spend a little more time on this program, it would be to present the results in infix notation. Maybe for the moment I will leave that as an exercise for the reader.

Wednesday, January 26, 2011

Handling dates with regular expressions

I have a confession to make. Part of the reason I have created this blog is as a memo to myself for little features that I have come across while using Python. I find myself being faced again and again with the same sort of problem and then I cannot remember in which script I last used it, so I go looking through all my scripts. There is another little confession; I used to lecture and enjoyed it. I am hoping that occasionally someone interested in learning Python will stumble across this blog and find something useful in it. Maybe even hoping that someone not interested in Python will stumble across this blog and become interested in Python.

Anyway, this morning I came across a request from a colleague who wanted to be able to ingest dates - but dates come in different formats, eg. 15/04/1707 and 1777-04-30 and even 16430104.

One way to treat these dates would be to split them using the split function and then try to work out which part of the date is the year, the month and the day. Another method is to use regular expressions - and that is the topic of this post.

I had some simple code which I developed a while back to attack exactly this type of problem. This is the code:

import re
date_format_yyyy_mm_dd = re.compile('^\d\d\d\d-\d\d-\d\d$')
date_format_dd_mm_yyyy = re.compile('^\d\d/\d\d/\d\d\d\d$')
date_format_yyyymmdd = re.compile('^\d{8}$')

stn_num_format = re.compile('^\d+$')

'''
Validate a date field
'''
def valid_date(date):
    return ((date_format_dd_mm_yyyy.match(date) is not None)
        or (date_format_yyyy_mm_dd.match(date) is not None)
        or (date_format_yyyymmdd) is not None)

The function valid_date will return true if a date uses one of the
formats above. We could, however, format our regular expressions more
elegantly as follows:

date_format_yyyy_mm_dd = re.compile('^\d{4}-\d{2}-\d{2}$')
date_format_dd_mm_yyyy = re.compile('^\d{2}/\d{2}/\d{4}$')

In the code above, once we know we have a valid date, we still have to parse
the date. In my original code, I used did the following:

def normalise_date(date):
    '''
    Normalise a date - ie. given a variety of date formats
    return a date.
    '''
    if date_format_dd_mm_yyyy.match(date) is not None:
        return datetime.date(int(date[6:10]), int(date[3:5]), int(date[0:2]))

    if date_format_yyyy_mm_dd.match(date) is not None:
        return datetime.date(int(date[0:4]), int(date[5:7]), int(date[8:10])) 

    if date_format_yyyymmdd.match(date) is not None:
        return datetime(int(date[0:4]), int(date[4:6]), int(date[6:8]))

    datafile.write('ERROR in normalise date: ' + date)

The code works - but is not particularly elegant. In particular, if we add a
new date format, we also have to add extra code to

normalise_date()

Fortunately regular expressions allow us to create groups and
automatically refer to them. We do this as follows:

date_format_yyyy_mm_dd = re.compile('^(\d{4})-(\d{2})-(\d{2})$')
date_format_dd_mm_yyyy = re.compile('^(\d{2})/(\d{2})/(\d{4})$')

Now, when we do a match, we can refer to each group. So, for example, the
following code:

import re
exp = re.compile('^(\d{4})-(\d{2})-(\d{2})$')
date = '2009-12-31'
m = exp.match(date)
print m.group(1) # Year

The above code prints 2009. We can, however, make the code clearer. We can
name each group and then refer to the group by name. This brings us to our
final example.

import re

date_format_yyyy_mm_dd = re.compile(
'^(?P\d{4})-(?P\d{2})-(?P\d{2})$')
date_format_dd_mm_yyyy = re.compile(
'^(?P\d{2})/(?P\d{2})/(?P\d{4})$')
date_format_yyyymmdd = re.compile(
'^(?P\d{4})(?P\d{2})(?P\d{2})$')

birthdays = ['15/04/1707', '1777-04-30', '16430104']

date_formats = [date_format_yyyy_mm_dd, date_format_dd_mm_yyyy,
    date_format_yyyymmdd]


for d in birthdays:
    for f in date_formats:
        m = f.match(d)
        if m:
            print 'Date: %s/%s/%s' % (m.group('day'), m.group('month'),
                m.group('year'))

This will give us the following output:

Date: 15/04/1707
Date: 30/04/1777
Date: 04/01/1643

Okay, exactly what we expected. Now, how do we handle the fact that in the US
dates are written MM/DD/YYYY? Don't even dare to go there!

ps. I know at least one reader of this blog for whom the dates will be
meaningful.

Tuesday, January 25, 2011

A mathematical curiosity

Today I came across this wonderful little curiosity.

I was going to write a little Python program to try it out. Basically, we are creating two fibonacci sequences, one starting with 12, 18 and the other starting with 5, 5. At first I thought there will be nothing new here. Then I remembered that I was looking for a good example with which to demonstrate generator functions and I figured out that this would be a perfect example. Basically, we are looking for a function that each time we call it will generate the next fibonacci number in a sequence.

There is an example of such a Fibonacci function in an earlier post which I will repeat here:

def fibonacci():
    parent, baby = 1, 0

    while baby < 1000:
        print baby
        parent, baby = (parent + baby, parent)

In this case, we don't want to print the fibonacci number, but want to return the next number in the series. We can do this with a generator function. A generator function is one that has the keyword yield. yield is like a return statement, except that the function remembers where we are and then the next time we call that function, we start from where we left off (or, in other words, directly after yield). Here is our fibonacci generator:

def fibonacci(t1, t2):
    while True:
        yield t1
    t1, t2 = t2, t1 + t2

fib = fibonacci(0, 1)
for i in range(10):
    print fib.next()

Running this we get:

Now comes the interesting part - we can create as many generators as we want, so here is our final program:

def fibonacci(t1, t2):
    while True:
        yield t1
    t1, t2 = t2, t1 + t2

lhs = fibonacci(12, 18)
rhs = fibonacci(5, 5)
for i in range(30):
    l = lhs.next()
    r = rhs.next()
    print '%10d %10d     %8.7f' % (l, r, float(l) / r)

So, when I run it, I get the following results:

12           5     2.4000000
18           5     3.6000000
30          10     3.0000000
48          15     3.2000000
78          25     3.1200000
126         40     3.1500000
204         65     3.1384615
330        105     3.1428571
534        170     3.1411765
864        275     3.1418182

This looks promising and it seems to converge really quickly. So, lets run it for 20 iterations.

12            5     2.4000000
18            5     3.6000000
30           10     3.0000000
48           15     3.2000000
78           25     3.1200000
126          40     3.1500000
204          65     3.1384615
330         105     3.1428571
534         170     3.1411765
864         275     3.1418182
1398        445     3.1415730
2262        720     3.1416667
3660       1165     3.1416309
5922       1885     3.1416446
9582       3050     3.1416393
15504      4935     3.1416413
25086      7985     3.1416406
40590     12920     3.1416409
65676     20905     3.1416408
106266    33825     3.1416408

Well, that's weird. It got really close to Pi (which, as far as I can remember is something like 3.1415927), but then it seems to wander off again. Well, maybe it oscillates for a while before settling down. So, let's try 30 iterations (I will just record the last few lines of output).

728358       231840     3.1416408
1178508      375125     3.1416408
1906866      606965     3.1416408
3085374      982090     3.1416408
4992240     1589055     3.1416408
8077614     2571145     3.1416408
13069854    4160200     3.1416408

Well, it seems that we're not converging on Pi, but something very close to Pi.

Maybe I should have just stopped after 10 iterations. I marvelled at how the ratio between two fibonacci sequences would converge on Pi. Now I am a bit wiser, but also a bit sadder.

Sunday, January 9, 2011

My wife recently was doing some programming for work. The work involved creating a database that contained the names of bulls. The bulls often have interesting names like "KIRK ANDREWS FORCEFUL" or "AULDREEKIE ADDISON VACUM".

It also happens that when people are searching for a bull, they may type in the wrong name, for example "AULDREKIE ADDISON VACUUM" or type "JISSELVLIEDT 135 BREAKOUT" instead of "IJSSELVLIEDT 135 BREAKOUT".

So, given a (mistyped) bull name, how can we find out what we think the user probably meant? One way to do this is by using a metric known as Levenshtein distance (or edit distance). The edit distance between two strings is defined as the minimum number of inserts, deletes or substitutions of a single character required to transform one string into another. For example the following transforms all have a Levenshtein distance of 1:

BOY -> TOY (1 substitution)
CHAT -> HAT (1 deletion)
HAT -> CHAT (1 insertion)

While the following have a Levenshtein distance of 2:

PAPER -> TAPE (1 deletion, 1 substituion)
TAPE -> TRADE (1 insertion, 1 substitution)

Wikipedia has a very good description of the algorithm.

They also have the pseudocode which I have implemented in Python. Here it is:

'''
Write a function that calculates the Levenshtein distance between
two words.
'''

def levenshtein(first, second):
    m = len(first) + 1
    n = len(second) + 1

    # Declare an array...
    d = [[0] * n for j in range(m)]

    # Initialise the array...
    for i in range(m):
    d[i][0] = i

    for j in range(n):
        d[0][j] = j

    for i in xrange(1, m):
        for j in range(1, n):
            if first[i-1] == second[j-1]: # no operation required...
                d[i][j] = d[i - 1][j - 1]
            else:
                d[i][j] = min( d[i - 1][j], d[i][j - 1], 
                    d[i - 1][j - 1]) + 1
    return d[m-1][n-1]


def main():
    list1 = 'great grate'.split()
    list2 = 'grate rate ate'.split()

    for word1 in list1:
        for word2 in list2:
            print 'Levenshtein distance between %s and %s is %d' % (
                word1, word2, levenshtein(word1, word2))


if __name__ == '__main__':
main()

Once again, the Python code looks very similar to the pseudocode.

While there are many other types of algorithms for spelling correction or data scrubbing (e.g. Soundex and Metaphone), Levenshtein distance is often useful because it does not depend on the language or the phonics of the words. It is interested only in symbols, insertions, deletions and substitutions.

Saturday, January 8, 2011

Python in Liouville

This morning I came across the following little piece of mathematical trivia:

Choose any number (e.g. 14) and write down its divisors:

14
1, 2, 7, 14

Then write down the number of divisors of each of these divisors:

14
1, 2, 7, 14
1, 2, 2, 4

Now the square of the sum of this last group will always equal the sum of its member's cubes:

(1 + 2 + 2 + 4)² = 1³ + 2³ + 2³ + 4³

Discovered by Joseph Liouville.

Well, I learned two new things today. Some mathematical trivia and that there was a French Mathematician that I had never heard of called Liouville.

Since I am always on the lookout for simple problems that can work as Python programming exercises, I decided to use the above problem.

Here is my first attempt:

def factorise(n):
    '''
    Given a number return a list of the factors of that number.
    '''
    factors = [1]
    i = 2
    while i <= n:
        if n % i == 0:
            factors.append(i)
        i += 1
    return factors


def try_num(n):
    factors = factorise(n)
    num_factors = []
    for factor in factors:
        num_factors.append(len(factorise(factor)))

    print 'Factors: ', num_factors
    print 'Square of sum: ', sum(num_factors) * sum(num_factors)
    print 'Sum of cubes: ', sum([factor * factor * factor for factor in num_factors])


def main():
    try_num(14)
    try_num(144)
    try_num(65536)

if __name__ == '__main__':
    main()

It works, but while making some small changes, I also noticed that we can do the factorising as a single list comprehension. In other words, we can replace

factors = [1]
i = 2
while i <= n:
    if n % i == 0:
        factors.append(i)
    i += 1
    return factors

with

return [x + 1 for x in xrange(n) if n % (x + 1) == 0]

Also, we can use a list comprehension for the loop in try_num. This led to the second version of the program:

def factorise(n):
    '''
    Given a number return a list of the factors of that number.
    '''
    return [x + 1 for x in xrange(n) if n % (x + 1) == 0]

def try_num(n):
    num_factors = [len(factorise(factor)) for factor in factorise(n)]

   print 'Factors: ', num_factors
   print 'Square of sum: ', sum(num_factors) ** 2
   print 'Sum of cubes: ', sum([factor ** 3 for factor in num_factors])

def main():
    try_num(14)
    try_num(144)
    try_num(2011)
    try_num(65536)

if __name__ == '__main__':
    main()

Even after years of using Python, I am still impressed by its ability to express certain things lucidly and compact way.

Friday, January 7, 2011

Tuple packing and unpacking in Python

I was chatting to a mate who has recently started learning Python. He mentioned that one of the cool features that got him sold on Python is the fact that you can swap two values with the following idiom:


x, y = y, x

Yes, that is neat and quite predictable given that Python has automatic packing and unpacking of tuples. In other words, Python packs the values on the right hand side into a tuple and the assignment unpacks each value on the right hand side into the values on the left hand side.

Extending this, we can do something like this:


a = [3, 1, 4]
x, y, z = sorted(a)

This will give us the lowest value in x and the highest value in z.

Once again, this is a nice little hack. I was trying to think of an example where packing and unpacking really adds some value (in terms of readability or 'Pythonicness'). This morning, I accidentally came across the following
example:

Suppose we want to generate a sequence of Fibonacci numbers (anyone who has done CS101 will recall that Fibonacci Numbers are the sequence of numbers 0, 1, 1, 2, 3, 5, 8, 13... where each number is created from the sum of the previous two numbers). Fibonacci, who went by many names, was apparently trying to calculate how many rabbits he would have if he started with one parent and each parent miraculously reproduced every 6 months having exactly one baby. After 12 months that baby would become a parent. Of course these miracle bunnies never die.

The following program calculates the rabbit population until before it hits 1,000:


parent, baby = 1, 0

while baby < 1000:
    print baby
    parent, baby = (parent + baby, parent)

I think this is a wonderful example of how one of Python's language features makes code really clear and how Python earns its reputation of 'executable pseudo-code'.

Postscript: Another nice little way we can use this feature is as follows:


day, month, year = '31/12/2010'.split('/')

Thursday, January 6, 2011

Winning money with Python

Recently someone posted the following on Reddit:

http://www.reddit.com/r/math/comments/erd1r/playing_card_mathematical_question_needs_your_help/

Given a pack of playing cards, select two cards (at random) from the pack. Suppose these are a two and a ten. Replace the cards in the pack and shuffle. Now turn each card over in turn and look for one of the cards (eg. two or ten) followed by the other either as the next card or the card following the next card. You should have a better than 50% chance of that happening.

There were a few posts trying to explain why this may happen and a few posts showing code (often in Python) that gave a simulation.

The first code that I came across was the following:

import random

cards = ['Ace','King','Queen','Jack','Ten','Nine','Eight','Seven','Six','Five','Four','Three','Two']

suits = ['Clubs','Spades','Hearts','Diamonds']

deck = []

for s in suits:
    for c in cards:
        deck.append(c + ' of ' + s)

NUM_TRIALS = 100000

wins = 0

for trial in range(0,NUM_TRIALS):
    #pick two non_identical card values
    card_one = cards[random.randint(0,len(cards)-1)]
    identical = True
    while identical:
        card_two = cards[random.randint(0,len(cards)-1)]
        if card_one <> card_two:
            identical = False

#shuffle the deck:
random.shuffle(deck)
#Separated by one?
sepaRated = False
i = 0
while i < len(deck) and not separated:    
    if card_one[0:2] == deck[i][0:2]:
        if i+1 <= len(deck)-1:
            if card_two[0:2] == deck[i+1][0:2]:
                separated = True

        if i+2 <= len(deck)-1:
            if card_two[0:2] == deck[i+2][0:2]:
                separated = True                

        if card_two[0:2] == deck[i][0:2]:
            if i+1 <= len(deck)-1:
                if card_one[0:2] == deck[i+1][0:2]:
                    separated = True

     if i+2 <= len(deck)-1:
         if card_one[0:2] == deck[i+2][0:2]:
             separated = True 

    i+=1

    if separated:
        wins+=1

print float(wins)/float(NUM_TRIALS)

The second solution was this one:

sum(map(bool,[(lambda k: k &(k >> 1 | k >> 3 | (k & ((2 << 116) - 1) / 3) 
>> 5))(int('0b00' + ''.join(__import__('random').sample(['01']*4+['10']
*4+['00']*44,52)),2)<<7) for i in range(2000000)]))/2000000.0

This got me thinking about Python, readibility and 'Pythonicness'. The first solution is much closer to the problem domain - almost painfully so. The cards have been modeled as cards and the values are all strings. It took me a few seconds to parse the following statement in my head:

if card_two[0:2] == deck[i+1][0:2]

And then realised that because we are using strings for the value, we need to compare the first two characters of each string. The author of the first program claims not to have much Python experience. The second program is painfully short. It would appear that the author has going to pains to fit it all onto one line. It may appear that the author has too much Python experience (although I realise s/he was just trying to make a point). Anyway, I thought it was a nice problem and so tried to come up with a Goldilocks solution... not too much experience, not too little experience. Not too much in the problem space and not too much in the solution space. I came up with the following solution:

'''
Given a deck of cards, select two cards at random
and run a simulation to determine how often those
two cards appear within two cards of each other.
'''

import random

'''
Since we are not concerned with the suit, we shall 
represent the deck as 4 lists of the cards from 1..13.
'''
cards = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] * 4

trials = 1000 # number of simulations

def check_sequence(deck, c1, c2):
    '''
    Given a deck of cards and two single cards, return True if the
    two cards appear within 2 cards of each other in the deck, else
    return False.
    '''
    for i in range(len(deck) - 2):
        if deck[i] == c1 and (deck[i+1] == c2 or deck[i+2] == c2):
    return True

    if deck[i] == c2 and (deck[i+1] == c1 or deck[i+2] == c1):
        return True

    return False


def run_simulation():
    '''
    Run a single event.
    Without loss of generality, we may assume that the two
    random cards we selected were 1 and 2.
    '''
    random.shuffle(cards)
    return 1 if check_sequence(cards, 1, 2) else 0


def main():
    successes = 0 # number of trials where we find the two cards within 1

    for i in xrange(trials):
        successes += run_simulation()

    print float(successes) / trials

if __name__ == '__main__':
    main()

Running this version gave me similar results to the previous two versions (about 0.73 - or, in other words, 73% of the time, we would expect to find two random cards within 1 card of each other). I was pleasantly surprised to find that my code was the most efficient of the lot (i.e. ran in the shortest time). I suspect that this was because I used

xrange()

rather than

range()

.

In order to try to monetarise my code, I offered a friend a $5.00 bet. I lost. I took another bet and lost again.

I guess there is a moral there somewhere, but I am not exactly sure what it is.

1	2	3	4
1	2	4	3
1	3	2	4
1	3	4	2
1	4	2	3
1	4	3	2
2	1	3	4
2	1	4	3	*
2	3	1	4
2	3	4	1	*
2	4	1	3	*
2	4	3	1
3	2	1	4
3	2	4	1
3	1	4	2	*
3	1	2	4
3	4	1	2	*
3	4	2	1	*
4	1	2	3	*
4	1	3	2
4	2	1	3
4	2	3	1
4	3	1	2	*
4	3	2	1	*

Original:	1	2	3	4	5
After the swap:	4	2	3	1	5

1	2	3	4
1	2	4	3
1	3	2	4
1	3	4	2
1	4	2	3
1	4	3	2
2	1	3	4
2	1	4	3	*
2	3	1	4
2	3	4	1	*
2	4	1	3	*
2	4	3	1
3	2	1	4
3	2	4	1
3	1	4	2	*
3	1	2	4
3	4	1	2	*
3	4	2	1	*
4	1	2	3	*
4	1	3	2
4	2	1	3
4	2	3	1
4	3	1	2	*
4	3	2	1	*

1	2	3	4
1	2	4	3
1	3	2	4
1	3	4	2
1	4	2	3
1	4	3	2
2	1	3	4
2	1	4	3	*
2	3	1	4
2	3	4	1	*
2	4	1	3	*
2	4	3	1
3	2	1	4
3	2	4	1
3	1	4	2	*
3	1	2	4
3	4	1	2	*
3	4	2	1	*
4	1	2	3	*
4	1	3	2
4	2	1	3
4	2	3	1
4	3	1	2	*
4	3	2	1	*