Wednesday, January 26, 2011

Handling dates with regular expressions

I have a confession to make. Part of the reason I have created this blog is as a memo to myself for little features that I have come across while using Python. I find myself being faced again and again with the same sort of problem and then I cannot remember in which script I last used it, so I go looking through all my scripts. There is another little confession; I used to lecture and enjoyed it. I am hoping that occasionally someone interested in learning Python will stumble across this blog and find something useful in it. Maybe even hoping that someone not interested in Python will stumble across this blog and become interested in Python.

Anyway, this morning I came across a request from a colleague who wanted to be able to ingest dates - but dates come in different formats, eg. 15/04/1707 and 1777-04-30 and even 16430104.

One way to treat these dates would be to split them using the split function and then try to work out which part of the date is the year, the month and the day. Another method is to use regular expressions - and that is the topic of this post.

I had some simple code which I developed a while back to attack exactly this type of problem. This is the code:

import re
date_format_yyyy_mm_dd = re.compile('^\d\d\d\d-\d\d-\d\d$')
date_format_dd_mm_yyyy = re.compile('^\d\d/\d\d/\d\d\d\d$')
date_format_yyyymmdd = re.compile('^\d{8}$')

stn_num_format = re.compile('^\d+$')

'''
Validate a date field
'''
def valid_date(date):
    return ((date_format_dd_mm_yyyy.match(date) is not None)
        or (date_format_yyyy_mm_dd.match(date) is not None)
        or (date_format_yyyymmdd) is not None)


The function valid_date will return true if a date uses one of the
formats above. We could, however, format our regular expressions more
elegantly as follows:

date_format_yyyy_mm_dd = re.compile('^\d{4}-\d{2}-\d{2}$')
date_format_dd_mm_yyyy = re.compile('^\d{2}/\d{2}/\d{4}$')


In the code above, once we know we have a valid date, we still have to parse
the date. In my original code, I used did the following:

def normalise_date(date):
    '''
    Normalise a date - ie. given a variety of date formats
    return a date.
    '''
    if date_format_dd_mm_yyyy.match(date) is not None:
        return datetime.date(int(date[6:10]), int(date[3:5]), int(date[0:2]))

    if date_format_yyyy_mm_dd.match(date) is not None:
        return datetime.date(int(date[0:4]), int(date[5:7]), int(date[8:10])) 

    if date_format_yyyymmdd.match(date) is not None:
        return datetime(int(date[0:4]), int(date[4:6]), int(date[6:8]))

    datafile.write('ERROR in normalise date: ' + date)


The code works - but is not particularly elegant. In particular, if we add a
new date format, we also have to add extra code to
normalise_date()


Fortunately regular expressions allow us to create groups and
automatically refer to them. We do this as follows:


date_format_yyyy_mm_dd = re.compile('^(\d{4})-(\d{2})-(\d{2})$')
date_format_dd_mm_yyyy = re.compile('^(\d{2})/(\d{2})/(\d{4})$')


Now, when we do a match, we can refer to each group. So, for example, the
following code:

import re
exp = re.compile('^(\d{4})-(\d{2})-(\d{2})$')
date = '2009-12-31'
m = exp.match(date)
print m.group(1) # Year


The above code prints 2009. We can, however, make the code clearer. We can
name each group and then refer to the group by name. This brings us to our
final example.

import re

date_format_yyyy_mm_dd = re.compile(
'^(?P\d{4})-(?P\d{2})-(?P\d{2})$')
date_format_dd_mm_yyyy = re.compile(
'^(?P\d{2})/(?P\d{2})/(?P\d{4})$')
date_format_yyyymmdd = re.compile(
'^(?P\d{4})(?P\d{2})(?P\d{2})$')

birthdays = ['15/04/1707', '1777-04-30', '16430104']

date_formats = [date_format_yyyy_mm_dd, date_format_dd_mm_yyyy,
    date_format_yyyymmdd]


for d in birthdays:
    for f in date_formats:
        m = f.match(d)
        if m:
            print 'Date: %s/%s/%s' % (m.group('day'), m.group('month'),
                m.group('year'))




This will give us the following output:

Date: 15/04/1707
Date: 30/04/1777
Date: 04/01/1643


Okay, exactly what we expected. Now, how do we handle the fact that in the US
dates are written MM/DD/YYYY? Don't even dare to go there!

ps. I know at least one reader of this blog for whom the dates will be
meaningful.

Tuesday, January 25, 2011

A mathematical curiosity

Today I came across this wonderful little curiosity.

I was going to write a little Python program to try it out. Basically, we are creating two fibonacci sequences, one starting with 12, 18 and the other starting with 5, 5. At first I thought there will be nothing new here. Then I remembered that I was looking for a good example with which to demonstrate generator functions and I figured out that this would be a perfect example. Basically, we are looking for a function that each time we call it will generate the next fibonacci number in a sequence.

There is an example of such a Fibonacci function in an earlier post which I will repeat here:

def fibonacci():
    parent, baby = 1, 0

    while baby < 1000:
        print baby
        parent, baby = (parent + baby, parent)
In this case, we don't want to print the fibonacci number, but want to return the next number in the series. We can do this with a generator function. A generator function is one that has the keyword yield. yield is like a return statement, except that the function remembers where we are and then the next time we call that function, we start from where we left off (or, in other words, directly after yield). Here is our fibonacci generator:
def fibonacci(t1, t2):
    while True:
        yield t1
    t1, t2 = t2, t1 + t2

fib = fibonacci(0, 1)
for i in range(10):
    print fib.next()

Running this we get:
0
1
1
2
3
5
8
13
21
34
Now comes the interesting part - we can create as many generators as we want, so here is our final program:
def fibonacci(t1, t2):
    while True:
        yield t1
    t1, t2 = t2, t1 + t2

lhs = fibonacci(12, 18)
rhs = fibonacci(5, 5)
for i in range(30):
    l = lhs.next()
    r = rhs.next()
    print '%10d %10d     %8.7f' % (l, r, float(l) / r)
So, when I run it, I get the following results:
12           5     2.4000000
18           5     3.6000000
30          10     3.0000000
48          15     3.2000000
78          25     3.1200000
126         40     3.1500000
204         65     3.1384615
330        105     3.1428571
534        170     3.1411765
864        275     3.1418182
This looks promising and it seems to converge really quickly. So, lets run it for 20 iterations.
12            5     2.4000000
18            5     3.6000000
30           10     3.0000000
48           15     3.2000000
78           25     3.1200000
126          40     3.1500000
204          65     3.1384615
330         105     3.1428571
534         170     3.1411765
864         275     3.1418182
1398        445     3.1415730
2262        720     3.1416667
3660       1165     3.1416309
5922       1885     3.1416446
9582       3050     3.1416393
15504      4935     3.1416413
25086      7985     3.1416406
40590     12920     3.1416409
65676     20905     3.1416408
106266    33825     3.1416408
Well, that's weird. It got really close to Pi (which, as far as I can remember is something like 3.1415927), but then it seems to wander off again. Well, maybe it oscillates for a while before settling down. So, let's try 30 iterations (I will just record the last few lines of output).
728358       231840     3.1416408
1178508      375125     3.1416408
1906866      606965     3.1416408
3085374      982090     3.1416408
4992240     1589055     3.1416408
8077614     2571145     3.1416408
13069854    4160200     3.1416408


Well, it seems that we're not converging on Pi, but something very close to Pi.

Maybe I should have just stopped after 10 iterations. I marvelled at how the ratio between two fibonacci sequences would converge on Pi. Now I am a bit wiser, but also a bit sadder.

Sunday, January 9, 2011

My wife recently was doing some programming for work. The work involved creating a database that contained the names of bulls. The bulls often have interesting names like "KIRK ANDREWS FORCEFUL" or "AULDREEKIE ADDISON VACUM".

It also happens that when people are searching for a bull, they may type in the wrong name, for example "AULDREKIE ADDISON VACUUM" or type "JISSELVLIEDT 135 BREAKOUT" instead of "IJSSELVLIEDT 135 BREAKOUT".

So, given a (mistyped) bull name, how can we find out what we think the user probably meant? One way to do this is by using a metric known as Levenshtein distance (or edit distance). The edit distance between two strings is defined as the minimum number of inserts, deletes or substitutions of a single character required to transform one string into another. For example the following transforms all have a Levenshtein distance of 1:

BOY -> TOY (1 substitution)
CHAT -> HAT (1 deletion)
HAT -> CHAT (1 insertion)


While the following have a Levenshtein distance of 2:

PAPER -> TAPE (1 deletion, 1 substituion)
TAPE -> TRADE (1 insertion, 1 substitution)


Wikipedia has a very good description of the algorithm.

They also have the pseudocode which I have implemented in Python. Here it is:

'''
Write a function that calculates the Levenshtein distance between
two words.
'''

def levenshtein(first, second):
    m = len(first) + 1
    n = len(second) + 1

    # Declare an array...
    d = [[0] * n for j in range(m)]

    # Initialise the array...
    for i in range(m):
    d[i][0] = i

    for j in range(n):
        d[0][j] = j

    for i in xrange(1, m):
        for j in range(1, n):
            if first[i-1] == second[j-1]: # no operation required...
                d[i][j] = d[i - 1][j - 1]
            else:
                d[i][j] = min( d[i - 1][j], d[i][j - 1], 
                    d[i - 1][j - 1]) + 1
    return d[m-1][n-1]


def main():
    list1 = 'great grate'.split()
    list2 = 'grate rate ate'.split()

    for word1 in list1:
        for word2 in list2:
            print 'Levenshtein distance between %s and %s is %d' % (
                word1, word2, levenshtein(word1, word2))


if __name__ == '__main__':
main()




Once again, the Python code looks very similar to the pseudocode.

While there are many other types of algorithms for spelling correction or data scrubbing (e.g. Soundex and Metaphone), Levenshtein distance is often useful because it does not depend on the language or the phonics of the words. It is interested only in symbols, insertions, deletions and substitutions.

Saturday, January 8, 2011

Python in Liouville

This morning I came across the following little piece of mathematical trivia:

Choose any number (e.g. 14) and write down its divisors:

14
1, 2, 7, 14

Then write down the number of divisors of each of these divisors:
14
1, 2, 7, 14
1, 2, 2, 4

Now the square of the sum of this last group will always equal the sum of its member's cubes:
(1 + 2 + 2 + 4)2 = 13 + 23 + 23 + 43

Discovered by Joseph Liouville.

Well, I learned two new things today. Some mathematical trivia and that there was a French Mathematician that I had never heard of called Liouville.

Since I am always on the lookout for simple problems that can work as Python programming exercises, I decided to use the above problem.

Here is my first attempt:

def factorise(n):
    '''
    Given a number return a list of the factors of that number.
    '''
    factors = [1]
    i = 2
    while i <= n:
        if n % i == 0:
            factors.append(i)
        i += 1
    return factors


def try_num(n):
    factors = factorise(n)
    num_factors = []
    for factor in factors:
        num_factors.append(len(factorise(factor)))

    print 'Factors: ', num_factors
    print 'Square of sum: ', sum(num_factors) * sum(num_factors)
    print 'Sum of cubes: ', sum([factor * factor * factor for factor in num_factors])


def main():
    try_num(14)
    try_num(144)
    try_num(65536)

if __name__ == '__main__':
    main()
It works, but while making some small changes, I also noticed that we can do the factorising as a single list comprehension. In other words, we can replace
factors = [1]
i = 2
while i <= n:
    if n % i == 0:
        factors.append(i)
    i += 1
    return factors
with
return [x + 1 for x in xrange(n) if n % (x + 1) == 0]
Also, we can use a list comprehension for the loop in try_num. This led to the second version of the program:
def factorise(n):
    '''
    Given a number return a list of the factors of that number.
    '''
    return [x + 1 for x in xrange(n) if n % (x + 1) == 0]

def try_num(n):
    num_factors = [len(factorise(factor)) for factor in factorise(n)]

   print 'Factors: ', num_factors
   print 'Square of sum: ', sum(num_factors) ** 2
   print 'Sum of cubes: ', sum([factor ** 3 for factor in num_factors])

def main():
    try_num(14)
    try_num(144)
    try_num(2011)
    try_num(65536)

if __name__ == '__main__':
    main()


Even after years of using Python, I am still impressed by its ability to express certain things lucidly and compact way.

Friday, January 7, 2011

Tuple packing and unpacking in Python

I was chatting to a mate who has recently started learning Python. He mentioned that one of the cool features that got him sold on Python is the fact that you can swap two values with the following idiom:


x, y = y, x


Yes, that is neat and quite predictable given that Python has automatic packing and unpacking of tuples. In other words, Python packs the values on the right hand side into a tuple and the assignment unpacks each value on the right hand side into the values on the left hand side.

Extending this, we can do something like this:


a = [3, 1, 4]
x, y, z = sorted(a)


This will give us the lowest value in x and the highest value in z.

Once again, this is a nice little hack. I was trying to think of an example where packing and unpacking really adds some value (in terms of readability or 'Pythonicness'). This morning, I accidentally came across the following
example:

Suppose we want to generate a sequence of Fibonacci numbers (anyone who has done CS101 will recall that Fibonacci Numbers are the sequence of numbers 0, 1, 1, 2, 3, 5, 8, 13... where each number is created from the sum of the previous two numbers). Fibonacci, who went by many names, was apparently trying to calculate how many rabbits he would have if he started with one parent and each parent miraculously reproduced every 6 months having exactly one baby. After 12 months that baby would become a parent. Of course these miracle bunnies never die.

The following program calculates the rabbit population until before it hits 1,000:


parent, baby = 1, 0

while baby < 1000:
print baby
parent, baby = (parent + baby, parent)


I think this is a wonderful example of how one of Python's language features makes code really clear and how Python earns its reputation of 'executable pseudo-code'.

Postscript: Another nice little way we can use this feature is as follows:

day, month, year = '31/12/2010'.split('/')

Thursday, January 6, 2011

Winning money with Python

Recently someone posted the following on Reddit:

http://www.reddit.com/r/math/comments/erd1r/playing_card_mathematical_question_needs_your_help/

Given a pack of playing cards, select two cards (at random) from the pack. Suppose these are a two and a ten. Replace the cards in the pack and shuffle. Now turn each card over in turn and look for one of the cards (eg. two or ten) followed by the other either as the next card or the card following the next card. You should have a better than 50% chance of that happening.

There were a few posts trying to explain why this may happen and a few posts showing code (often in Python) that gave a simulation.

The first code that I came across was the following:

import random

cards = ['Ace','King','Queen','Jack','Ten','Nine','Eight','Seven','Six','Five','Four','Three','Two']

suits = ['Clubs','Spades','Hearts','Diamonds']

deck = []

for s in suits:
    for c in cards:
        deck.append(c + ' of ' + s)

NUM_TRIALS = 100000

wins = 0

for trial in range(0,NUM_TRIALS):
    #pick two non_identical card values
    card_one = cards[random.randint(0,len(cards)-1)]
    identical = True
    while identical:
        card_two = cards[random.randint(0,len(cards)-1)]
        if card_one <> card_two:
            identical = False

#shuffle the deck:
random.shuffle(deck)
#Separated by one?
sepaRated = False
i = 0
while i < len(deck) and not separated:    
    if card_one[0:2] == deck[i][0:2]:
        if i+1 <= len(deck)-1:
            if card_two[0:2] == deck[i+1][0:2]:
                separated = True

        if i+2 <= len(deck)-1:
            if card_two[0:2] == deck[i+2][0:2]:
                separated = True                

        if card_two[0:2] == deck[i][0:2]:
            if i+1 <= len(deck)-1:
                if card_one[0:2] == deck[i+1][0:2]:
                    separated = True

     if i+2 <= len(deck)-1:
         if card_one[0:2] == deck[i+2][0:2]:
             separated = True 

    i+=1

    if separated:
        wins+=1

print float(wins)/float(NUM_TRIALS)
The second solution was this one:
sum(map(bool,[(lambda k: k &(k >> 1 | k >> 3 | (k & ((2 << 116) - 1) / 3) 
>> 5))(int('0b00' + ''.join(__import__('random').sample(['01']*4+['10']
*4+['00']*44,52)),2)<<7) for i in range(2000000)]))/2000000.0  
This got me thinking about Python, readibility and 'Pythonicness'. The first solution is much closer to the problem domain - almost painfully so. The cards have been modeled as cards and the values are all strings. It took me a few seconds to parse the following statement in my head:
if card_two[0:2] == deck[i+1][0:2]
And then realised that because we are using strings for the value, we need to compare the first two characters of each string. The author of the first program claims not to have much Python experience. The second program is painfully short. It would appear that the author has going to pains to fit it all onto one line. It may appear that the author has too much Python experience (although I realise s/he was just trying to make a point). Anyway, I thought it was a nice problem and so tried to come up with a Goldilocks solution... not too much experience, not too little experience. Not too much in the problem space and not too much in the solution space. I came up with the following solution:
'''
Given a deck of cards, select two cards at random
and run a simulation to determine how often those
two cards appear within two cards of each other.
'''

import random

'''
Since we are not concerned with the suit, we shall 
represent the deck as 4 lists of the cards from 1..13.
'''
cards = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] * 4

trials = 1000 # number of simulations

def check_sequence(deck, c1, c2):
    '''
    Given a deck of cards and two single cards, return True if the
    two cards appear within 2 cards of each other in the deck, else
    return False.
    '''
    for i in range(len(deck) - 2):
        if deck[i] == c1 and (deck[i+1] == c2 or deck[i+2] == c2):
    return True

    if deck[i] == c2 and (deck[i+1] == c1 or deck[i+2] == c1):
        return True

    return False


def run_simulation():
    '''
    Run a single event.
    Without loss of generality, we may assume that the two
    random cards we selected were 1 and 2.
    '''
    random.shuffle(cards)
    return 1 if check_sequence(cards, 1, 2) else 0


def main():
    successes = 0 # number of trials where we find the two cards within 1

    for i in xrange(trials):
        successes += run_simulation()

    print float(successes) / trials

if __name__ == '__main__':
    main()
Running this version gave me similar results to the previous two versions (about 0.73 - or, in other words, 73% of the time, we would expect to find two random cards within 1 card of each other). I was pleasantly surprised to find that my code was the most efficient of the lot (i.e. ran in the shortest time). I suspect that this was because I used
xrange() 
rather than
range()
.

In order to try to monetarise my code, I offered a friend a $5.00 bet. I lost. I took another bet and lost again.

I guess there is a moral there somewhere, but I am not exactly sure what it is.