Wednesday, January 26, 2011

Handling dates with regular expressions

I have a confession to make. Part of the reason I have created this blog is as a memo to myself for little features that I have come across while using Python. I find myself being faced again and again with the same sort of problem and then I cannot remember in which script I last used it, so I go looking through all my scripts. There is another little confession; I used to lecture and enjoyed it. I am hoping that occasionally someone interested in learning Python will stumble across this blog and find something useful in it. Maybe even hoping that someone not interested in Python will stumble across this blog and become interested in Python.

Anyway, this morning I came across a request from a colleague who wanted to be able to ingest dates - but dates come in different formats, eg. 15/04/1707 and 1777-04-30 and even 16430104.

One way to treat these dates would be to split them using the split function and then try to work out which part of the date is the year, the month and the day. Another method is to use regular expressions - and that is the topic of this post.

I had some simple code which I developed a while back to attack exactly this type of problem. This is the code:

import re
date_format_yyyy_mm_dd = re.compile('^\d\d\d\d-\d\d-\d\d$')
date_format_dd_mm_yyyy = re.compile('^\d\d/\d\d/\d\d\d\d$')
date_format_yyyymmdd = re.compile('^\d{8}$')

stn_num_format = re.compile('^\d+$')

'''
Validate a date field
'''
def valid_date(date):
    return ((date_format_dd_mm_yyyy.match(date) is not None)
        or (date_format_yyyy_mm_dd.match(date) is not None)
        or (date_format_yyyymmdd) is not None)


The function valid_date will return true if a date uses one of the
formats above. We could, however, format our regular expressions more
elegantly as follows:

date_format_yyyy_mm_dd = re.compile('^\d{4}-\d{2}-\d{2}$')
date_format_dd_mm_yyyy = re.compile('^\d{2}/\d{2}/\d{4}$')


In the code above, once we know we have a valid date, we still have to parse
the date. In my original code, I used did the following:

def normalise_date(date):
    '''
    Normalise a date - ie. given a variety of date formats
    return a date.
    '''
    if date_format_dd_mm_yyyy.match(date) is not None:
        return datetime.date(int(date[6:10]), int(date[3:5]), int(date[0:2]))

    if date_format_yyyy_mm_dd.match(date) is not None:
        return datetime.date(int(date[0:4]), int(date[5:7]), int(date[8:10])) 

    if date_format_yyyymmdd.match(date) is not None:
        return datetime(int(date[0:4]), int(date[4:6]), int(date[6:8]))

    datafile.write('ERROR in normalise date: ' + date)


The code works - but is not particularly elegant. In particular, if we add a
new date format, we also have to add extra code to
normalise_date()


Fortunately regular expressions allow us to create groups and
automatically refer to them. We do this as follows:


date_format_yyyy_mm_dd = re.compile('^(\d{4})-(\d{2})-(\d{2})$')
date_format_dd_mm_yyyy = re.compile('^(\d{2})/(\d{2})/(\d{4})$')


Now, when we do a match, we can refer to each group. So, for example, the
following code:

import re
exp = re.compile('^(\d{4})-(\d{2})-(\d{2})$')
date = '2009-12-31'
m = exp.match(date)
print m.group(1) # Year


The above code prints 2009. We can, however, make the code clearer. We can
name each group and then refer to the group by name. This brings us to our
final example.

import re

date_format_yyyy_mm_dd = re.compile(
'^(?P\d{4})-(?P\d{2})-(?P\d{2})$')
date_format_dd_mm_yyyy = re.compile(
'^(?P\d{2})/(?P\d{2})/(?P\d{4})$')
date_format_yyyymmdd = re.compile(
'^(?P\d{4})(?P\d{2})(?P\d{2})$')

birthdays = ['15/04/1707', '1777-04-30', '16430104']

date_formats = [date_format_yyyy_mm_dd, date_format_dd_mm_yyyy,
    date_format_yyyymmdd]


for d in birthdays:
    for f in date_formats:
        m = f.match(d)
        if m:
            print 'Date: %s/%s/%s' % (m.group('day'), m.group('month'),
                m.group('year'))




This will give us the following output:

Date: 15/04/1707
Date: 30/04/1777
Date: 04/01/1643


Okay, exactly what we expected. Now, how do we handle the fact that in the US
dates are written MM/DD/YYYY? Don't even dare to go there!

ps. I know at least one reader of this blog for whom the dates will be
meaningful.

No comments:

Post a Comment