Sunday, January 9, 2011

My wife recently was doing some programming for work. The work involved creating a database that contained the names of bulls. The bulls often have interesting names like "KIRK ANDREWS FORCEFUL" or "AULDREEKIE ADDISON VACUM".

It also happens that when people are searching for a bull, they may type in the wrong name, for example "AULDREKIE ADDISON VACUUM" or type "JISSELVLIEDT 135 BREAKOUT" instead of "IJSSELVLIEDT 135 BREAKOUT".

So, given a (mistyped) bull name, how can we find out what we think the user probably meant? One way to do this is by using a metric known as Levenshtein distance (or edit distance). The edit distance between two strings is defined as the minimum number of inserts, deletes or substitutions of a single character required to transform one string into another. For example the following transforms all have a Levenshtein distance of 1:

BOY -> TOY (1 substitution)
CHAT -> HAT (1 deletion)
HAT -> CHAT (1 insertion)

While the following have a Levenshtein distance of 2:

PAPER -> TAPE (1 deletion, 1 substituion)
TAPE -> TRADE (1 insertion, 1 substitution)

Wikipedia has a very good description of the algorithm.

They also have the pseudocode which I have implemented in Python. Here it is:

Write a function that calculates the Levenshtein distance between
two words.

def levenshtein(first, second):
    m = len(first) + 1
    n = len(second) + 1

    # Declare an array...
    d = [[0] * n for j in range(m)]

    # Initialise the array...
    for i in range(m):
    d[i][0] = i

    for j in range(n):
        d[0][j] = j

    for i in xrange(1, m):
        for j in range(1, n):
            if first[i-1] == second[j-1]: # no operation required...
                d[i][j] = d[i - 1][j - 1]
                d[i][j] = min( d[i - 1][j], d[i][j - 1], 
                    d[i - 1][j - 1]) + 1
    return d[m-1][n-1]

def main():
    list1 = 'great grate'.split()
    list2 = 'grate rate ate'.split()

    for word1 in list1:
        for word2 in list2:
            print 'Levenshtein distance between %s and %s is %d' % (
                word1, word2, levenshtein(word1, word2))

if __name__ == '__main__':

Once again, the Python code looks very similar to the pseudocode.

While there are many other types of algorithms for spelling correction or data scrubbing (e.g. Soundex and Metaphone), Levenshtein distance is often useful because it does not depend on the language or the phonics of the words. It is interested only in symbols, insertions, deletions and substitutions.

No comments:

Post a Comment