Monday, April 11, 2011

Miller-Rabin to the rescue

I love words and I love primes, so when I came across this article on prime words, naturally I was tempted to try to implement it in Python.

In case you can't click through to the article, here is what we want to do. In base 36 every letter of the alphabet represents a digit. So every word is a number. The digit A has the value 11, B has the value 12 and D has the value 14, so BAD would represent 12*(36^2) + 11*(36^1) + 14*(36^0).

This value may or may not be prime. Our program will read all the words in a dictionary and print out the words which have a prime value.

Here is my first (naive) attempt:

#
# Read a dictionary and print out all words which are prime base 36.
#

import math

def isprime(num):
    '''
    Return true if a number is prime, else false.
    '''

    # most of the numbers will be tested by the next four lines
    if num == 2 or num == 3 or num == 5:
        return True
    if num % 2 == 0 or num % 3 == 0 or num % 5 == 0:
        return False
    
    # For the remainder...
    n = 5
    while (n <= math.sqrt(num)):
        if num % n == 0:
            return False
        n += 2
    return True


for word in open('/usr/share/dict/words'):
    if isprime(int(word.rstrip().upper(), 36)):
        print(word)

So, what is wrong with it? Well, I started running it and I started to see a few words, for example:
Ababdeh
abacist
abandon
But then things slowed down. As soon as I thought about it for a few seconds, I realised that some words probably have quite high integer values. For example, let's take ankyloblepharon (the condition where your eyelids are stuck together). Let's convert this to an integer base 64:
>>> int('ankyloblepharon', 36)
65432123906951596638119L
Okay, that's a pretty big number. Fortunately we only test half the numbers in the range. That's still pretty big. Even more fortunate, we only test up the the square root of the number which gives us:
>>>  math.sqrt(int('ankyloblepharon', 36) /2)
180875819150.80798
Adding a few commas, we get: 180,875,819,150 or to put it more simply, 180 billion tests. Clearly we need something more efficient. Luckily there is a test for primality called Miller-Rabin which is a "probabilistic" test. We test a number to see if it's prime. If the test tells us the number is composite, we are sure it is composite. If the test tells us the number is prime, we are pretty sure it is prime. Depending on the parameter we use, pretty sure can mean something like 99.999999999999% sure. I used the Wikipedia definition of Miller-Rabin. Here is the revised code:
#
# Read a dictionary and print out all words which are prime base 36.
#

import math


#
# Implementation of Miller-Rabin based on Wikipedia algorithm
#
# Algorithm:
# Input: n > 3, an odd integer to be tested for primality;
# Input: k, a parameter that determines the accuracy of the test
# Output: composite if n is composite, otherwise probably prime
# write n - 1 as 2^s*d with d odd by factoring powers of 2 from n - 1
# LOOP: repeat k times:
#   pick a random integer a in the range [2, n - 2]
#   x <- a^d mod n
#   if x = 1 or x = n - 1 then do next LOOP
#   for r = 1 .. s - 1
#      x <- x2 mod n
#      if x = 1 then return composite
#      if x = n - 1 then do next LOOP
#   return composite
#return probably prime
# 
#
import random

def inner_test(n, s, x):
    for r in range(1, s):
        x = pow(x, 2, n)
        if x == 1:
            return False
        if x == n - 1:
            return True 
    return False 


def miller_rabin(n):
    if n < 2:
        return False

    if n % 2 == 0:
        return n == 2

    k = 30 # Number of tests
    s = 0 
    d = n - 1
    while d % 2 == 0:
        s += 1
        d = d / 2
    for i in range(k):
        a = random.randint(2, n - 1)
        x = pow(a, d, n) 
       
        if x == 1 or x == n - 1:
            continue
        
        if not inner_test(n, s, x):
            return False
    return True



for word in open('/usr/share/dict/words'):
    if miller_rabin(int(word.rstrip().upper(), 36)):
        print(word)


So, how much better was the revised code? Well, in less time than it took to print the first four words using the first test, the revised test found 12,000+ words.

Interestingly, there is also a deterministic version of Miller-Rabin. According to the Wikipedia article, we can test numbers up to 341,550,071,728,321 using only the following numbers: 2, 3, 5, 7, 11, 13 and 17. I have a feeling that limit should cover most of the words in our dictionary, but that is a task for another day. I think a speed up of about 4,000 times is sufficient for one day.