tag:blogger.com,1999:blog-83428514164234053492015-09-16T17:00:49.499-07:00mudeltasigmamudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.comBlogger19125tag:blogger.com,1999:blog-8342851416423405349.post-37589865522705350802012-04-27T16:32:00.001-07:002012-04-27T16:42:52.159-07:00Cracking HTML with Unix<p><style>pre {font-family: monospace, "Andale Mono", "Lucida Console", "Monaco", fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%} </style></p> <p>Our company uses the enterprise social networking site Yammer. For those who may not be familiar with it, it is a sort of internal Facebook.</p> <p>Somebody had created a graph of how membership had increased over the months, but the person had left the company. People were interested to see how the membership had increased in the last year. I decided to look at the problem.</p> <p>Yammer has a members page and the page shows the date the person joined.</p> <p>If we look at the source of the page, it looks something like this:</p> <pre><code><td class="identity"><br /> <div class="info"><br /> <img alt="Status_on" ... /><br /> <a href="???" class="primary yj-hovercard-link" title="view profile">???&/a><br /> </div><br /></td><br /><br /><td class="joined_at"><br /> Nov 24, 2010<br /></td><br /></code></pre> <p>So, I saved the source for all the members in a file called <tt>members.txt</tt>. This file had about 75,000 lines.</p> <p>The part I was interested in was the last three lines, or more particularly, the actual date joined. I figured that if I had all the dates joined, I could create the required histogram.</p> <p>The easiest way to do this was to use <tt>grep</tt> to find all the lines containing <tt>joined_at</tt> and then use the context option (<tt>-A 1</tt>) the show the line following. This gave:</p> <pre><code><br />$ grep -A 1 joined_at members.txt<br /><td class="joined_at"><br /> Nov 24, 2010<br />--<br /><td class="joined_at"><br /> Nov 24, 2010<br />--<br /><td class="joined_at"><br /> Nov 28, 2010<br />...<br /></code></pre> <p>To clean up the output, I used <tt>grep -v</tt> which gave:</p> <pre><code>$ grep -A 1 joined_at members.txt | grep -v joined | grep -v -- -- <br /> Nov 24, 2010<br /> Nov 24, 2010<br /> Nov 28, 2010<br />...<br /></code></pre> <p>I was not interested in the day on which people joined, only the month and year. Also, I wanted it in the format <tt>year month</tt>. This was easily accomplished using <tt>AWK</tt></p> <pre><code>awk '{print $3, $1}' </code></pre> <p>In other words, print the third and first field of the output.</p> <p>We now have:</p> <pre><code>... | awk '{print $3, $1}'<br />2010 Nov<br />2010 Nov<br />2010 Nov<br />2010 Nov<br />2009 Oct<br />...<br /></code></pre> <p>As you will notice, the data is not necessarily in sorted order. In order to sort numerically, we need the month number, not name, for the month. Converting 'Nov' to '10' is done easily using <tt>sed</tt>. I won''t type the full command, but it looks like this:</p> <pre><code>sed 's/Jan/01/;s/Feb/02/;s/Mar/03/;...'<br /></code></pre> <p>So, we now have:</p> <pre><code>2010 11<br />2010 11<br />2010 11<br />2009 10<br />...<br /></code></pre> <p>All that is left to do is to sort the output numerically (<tt>sort -n</tt>) and then count the number of unique occurrences of each line (<tt>uniq -c</tt>).</p> <p>Doing this gives us:</p> <pre><code> 9 2009 10<br /> 1 2010 10<br /> 64 2010 11<br /> 112 2010 12<br /> 403 2011 01<br /> 60 2011 02<br /> 55 2011 03<br /> 23 2011 04<br /> 33 2011 05<br /> 36 2011 06<br /> 18 2011 07<br /> 60 2011 08<br /> 31 2011 09<br /> 42 2011 10<br /> 22 2011 11<br /> 21 2011 12<br /> 22 2012 01<br /> 23 2012 02<br /> 10 2012 03<br /> 40 2012 04<br /></code></pre> <p>A histogram showing the number of people that joined each month since 2009. There is an interesting network effect in the above data. As soon as the site grew past a critical point, there was an explosion of new members (which took the number of members to just under half of all members of the organisation) and then the members signing up slowed and became almost constant.</p> <p>However, what I wanted to show in this post was how having a basic knowledge of Unix tools, it is possible to do some reasonably advanced analytics on what may initially seem to be quite unstructured data.</p>mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-5481721941419498732012-03-31T16:59:00.000-07:002012-03-31T17:07:00.169-07:00Let's tawk about AWK<style>pre {font-family: "Andale Mono", "Lucida Console", "Monaco", fixed, monospace; color: #000000; background-color: #eee; font-size: 12px; border: 1px dashed #999999; line-height: 14px; padding: 5px; overflow: auto; width: 100%} </style> <p>Recently I was at a Python conference and I listened to a speaker tell us that he used to use Fortran, C, Lisp, AWK and a few other languages, but now that he uses Python, he no longer needs to use those other languages. </p><p>This surprised me. Although I use Python whenever I can and really enjoy using it, I don't always think it is the best or most suitable language to use for every occasion. In particular, it was AWK which made me think "Why would someone stop using AWK and use Python instead". <p></p>I have been meaning to write a short AWK tutorial for months, and recently came across a coworker who I regard as a good programmer who had not heard of AWK which inspired me to start writing this tutorial. <p></p>So, in case you've never heard of AWK, what is it? It is a small language that has been around since 1977 and was part of Version 7 Unix. </p><p>It is very simple. The structure of an AWK program is generally a series of lines where each line of the series has the following format: </p><p><pre class="code"><br />expression { statement }<br />expression { statement }<br />...<br /></pre></p><p>AWK is like grep in that it looks at each line of a file and processes the file on a line by line basis. However, it will be much easier to demonstrate than explain. </p><p>Suppose we have a file with the following contents: </p><p><pre class="code"><br />...<br />9021 2000-12-27 0 0<br />9021 2000-12-28 0 0<br />9021 2000-12-29 .6 0<br />9021 2000-12-30 0 0<br />9021 2000-12-31 0 0<br />86071 2000-01-01 .4 0<br />86071 2000-01-02 0 0<br />86071 2000-01-03 0 0<br />86071 2000-01-04 3.4 0<br />...<br /></pre></p><p>The first column (affectionately known as $1 in AWK), is the identifier for a weather station. The second column ($2) is obviously a date, the third column is the amount of rain that fell in the previous 24 hours and the 4th column is a quality flag. </p><p>Suppose we wanted to print all records, but only for station 9021, we could do it with the following program: </p><p><pre class="code"><br />awk '$1 == 9021 {print $0}' rain_data.txt<br /></pre> This will give us: <pre><br />9021 2000-01-01 0 0<br />9021 2000-01-02 0 0<br />9021 2000-01-03 0 0<br />9021 2000-01-04 0 0<br />9021 2000-01-05 0 0<br />9021 2000-01-06 0 0<br />...<br /></pre></p><p>In the above example, the expression was '$1 == 9021' and the statement was '{print $0}'. $0 is AWK shorthand for "the whole line". </p><p>If we have an expression and we don't have a statement, then the default action is to print the whole line, so we can shorten the above to: </p><pre><br />awk '$1 == 9021' rain_data.txt<br /></pre><p>Also, we should note that the default field separator is whitespace (it can be one or more spaces or tabs). Now, suppose we wanted to display records for days that actually had some rainfall, we could write: <pre><br />awk '$3 > 0.0' rain_data.txt<br /></pre> This will give: <pre><br />9021 2000-01-12 .8 0<br />9021 2000-01-15 13 0<br />9021 2000-01-16 5 0<br />9021 2000-01-17 4.6 0<br />...<br /></pre>Now suppose we want days where the rainfall was above 10mm and the station was 86071, we could do it as follows: <pre><br />awk '$3 > 10.0 && $1 == 86071' rain_data.txt<br /></pre> <pre><br />9021 2000-01-12 .8 0<br />9021 2000-01-15 13 0<br />9021 2000-01-16 5 0<br />9021 2000-01-17 4.6 0<br />...<br /></pre></p><p>We have looked at the expressions, so now lets look at the statements. Often we want to print a subset of the fields. Lets print out only the first three fields (in other words omit the quality field). Since we are not printing the quality field, we will select only records where the quality is good (equal to zero). </p><p><pre><br />awk '$3 > 10.0 && $4 == 0 {print $1, $2, $3}' rain_data.txt<br /></pre> This gives: <pre><br />9021 2000-01-15 13<br />9021 2000-01-22 60.8<br />9021 2000-01-23 14.4<br />9021 2000-03-22 24<br />...<br /></pre> Now, we will print the rainfall in inches (mm / 25.4) <pre><br />awk '$3 > 10.0 && $4 == 0 {print $1, $2, $3/25.4}' rain_data.txt<br /></pre> Which gives <pre><br />9021 2000-01-15 0.511811<br />9021 2000-01-22 2.3937<br />9021 2000-01-23 0.566929<br />9021 2000-03-22 0.944882<br />...<br /></pre> </p><p>There are two special expressions in AWK, 'BEGIN' and 'END'. BEGIN matches before the first line of a file has been read and END matches after the last line of text has been read. These are often used for initialisation or printing totals. </p><p>As soon as we start doing anything more complex than a single line, it makes sense to create a small script, for example prog.awk, which we then run as follows: </p><p><pre><br />awk -f prog.awk datafile<br /></pre></p><p>In the first script, we want to find the maximum rainfall of all the values in our file. We create the following script (maxrain.awk) </p><p><pre><br /> { if ($3 > max) {max = $3} }<br />END { print max }<br /></pre></p><p>The first line does not have an expression, so it matches all records. The if statement says that if the value in the third column (rainfall), is greater than the current value in max, replace it. As mentioned above, 'END' will match after all lines have been processed and will print out the value of max. </p><p>We will look at one final aspect of AWK which is associative arrays. An associative array is an array where the index does not have to be an integer (although in this case it will be). We also don't have to declare the size of the array as we do in many other languages. </p><p>In this example, we will find the maximum for each of the stations. We do this by creating an array of maximum values where the station number is the index of the array. So, for example, we have station number 86071 and station number 9021. maxarray[86071] will contain the maximum value that we have found so far for that station and maxarray[9021] will contain the maximum value for station 9021. <pre><br /> { if ($3 > maxarray[$1]) {maxarray[$1] = $3} }<br />END { for (station in maxarray) print station, maxarray[station] }<br /></pre> Running this we get the following output: <pre><br />9021 60.8<br />86071 38.4<br />...<br /></pre></p><p>AWK is a small language and a very powerful language. It is not a general purpose language like Python with a rich library. But I sometimes use it many times a day, because those things that it does well, it does very well. It is also very simple and concise. I rarely need to look up any of the functions of AWK. It is concise enough that many times it can just be typed in from the command line. Any software engineer should have many tools in their toolbelt and choose the best tool for the job. It is surprising how often a little language like AWK will turn out to be that tool.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-76996204102255019562012-03-30T13:48:00.000-07:002012-03-30T13:48:05.530-07:00Quicksort in Python<p>I have been brushing up on my algorithms. One of the algorithms that I looked at was Quicksort. It is a sorting algorithm developed by Tony Hoare in 1960. If you are using the built in sort function in C, you are using quicksort. </p><p>Quicksort is one of those algorithms you study in university and then often forget. However, as I mention, I was brushing up on my algorithms and so decided to revisit quicksort. The way I usually understand algorithms is to first look at the algorithm and then look at an implementation. </p><p>Here is a quick description of the algorithm: </p> Quicksort: <ol><li> Take an array to be sorted. </li> <li> If the array has fewer than two values, we will consider it sorted (return).</li> <li> Choose a value which we will call the pivot value. This can be the first element of the array.</li> <li> Rearrange the array so that all values less than the pivot value are placed before the pivot value and all other values are placed after the pivot value.</li> <li> Run Quicksort over each of the two subarrays created in the previous step.</li> </ol> Here is an implemenation in C (from http://rosettacode.org/wiki/Quicksort): <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><br /><br />void quick_sort (int *a, int n) {<br /> if (n < 2)<br /> return;<br /> int p = a[n / 2];<br /> int *l = a;<br /> int *r = a + n - 1;<br /> while (l <= r) {<br /> while (*l < p)<br /> l++;<br /> while (*r > p)<br /> r--;<br /> if (l <= r) {<br /> int t = *l;<br /> *l++ = *r;<br /> *r-- = t;<br /> }<br /> }<br /> quick_sort(a, r - a + 1);<br /> quick_sort(l, a + n - l);<br />}<br /> <br />int main () {<br /> int a[] = {4, 65, 2, -31, 0, 99, 2, 83, 782, 1};<br /> int n = sizeof a / sizeof a[0];<br /> quick_sort(a, n);<br /> return 0;<br />}<br /><br /></pre><p>While we can see what the above code is doing, and we can map the variable names to the description from the algorithm, it is not immediately clear how the algorithm works. </p><p>In fact, if we replaced the name quick_sort by q, and asked someone what the code was doing, it would be reasonably hard for most people to guess that it was a quicksort. </p><p>The next thing I did was to look for a Python implementation and found this one (http://en.literateprograms.org/Quicksort_(Python)?printable=yes): </p> <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><br />def qsort1(list):<br /> """<br /> Quicksort using list comprehensions<br /> """<br /> if list == []: <br /> return []<br /> else:<br /> pivot = list[0]<br /> lesser = qsort1([x for x in list[1:] if x < pivot])<br /> greater = qsort1([x for x in list[1:] if x >= pivot])<br /> return lesser + [pivot] + greater<br /></pre><p>Immediately it is obvious what the algorithm is doing. Looking at the algorithm, though, we notice that the first line of the function: </p> <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><br />if list == []<br /></pre><p>Looks for an empty list. This is because if we get an empty list, we can assume it is sorted. However, if we get a list with only one element, we can also assume it is sorted. I changed the lines to: </p> <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><br />if len(list) <= 1: <br /> return list<br /></pre><p>I measured the speed of the two algorithms and the second gave me a 10% improvement in speed. Would I have noticed this so easily in the C code? I am not sure. However, having noticed this in the Python code, it seemed fairly obvious that it could be improved and I also felt confident that my improvement was going to work. </p><p>So, even though Python may not be as efficient as C or some compiled languages, the clarity of the code certainly allows us to make improvements and optimisations that may otherwise have been very difficult. </p>mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-8069090199143208202011-07-30T21:18:00.000-07:002012-04-30T17:23:03.905-07:00Secret Santa and Derangements<script type="text/x-mathjax-config"> MathJax.Hub.Config({tex2jax: {inlineMath: [['$$','$$'], ['\\(','\\)']]}}); </script><br /><script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script><br /><br /><br /><br />Here's an interesting question. Suppose you're holding a Secret Santa - you write down the names of your friends and yourself and put all the names in a hat. Each person pulls a name out of the hat and they must buy a present for the person whose name they pull out. What is the chance that someone will get their own name? <br /><br />I recently discovered that if we permute an ordered set in such a way that no element is in its original position, this is called a derangement. So, we can rephrase the question in the first paragraph as what is the chance of a derangement. <br /><br />If we have one name in our hat, then the chance of a derangement is 0. If we have two names, the chances are 50%. If we have 3 names, then there are 6 permutations. Of these, 4 of them are derangements, so we have a 2/3 chance of a derangement. As we have more names, things get a little complicated. We should be able to use mathematics to solve this, but since I am more adept with writing software than with mathematics, I thought I would write a simulation in Python to calculate the probability and then ask my friend and excellent mathematician, Alasdair, to do the mathematical analysis. So, without further ado, here is a Python program which simulates a Secret Santa for 10, 20, 30, 40, 50 and 100 people and runs<br />each simulation 10,000 times.<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">'''<br />Write a Monte Carlo simulation to estimate the chance<br />of a derangement.<br />'''<br /><br />import random<br /><br />def event(n):<br /> '''<br /> Given n people, create an array of n elements<br /> shuffle the array and then return True if it's<br /> a derangement, else False.<br /> '''<br /> a = [i for i in range(n)]<br /> random.shuffle(a)<br /><br /> for i in xrange(n):<br /> if i == a[i]:<br /> return False<br /> return True<br /><br />simulations = 10000<br />for people in [10, 20, 30, 40, 50, 100]:<br /> derangements = 0<br /> for i in range(simulations):<br /> if event(people):<br /> derangements += 1<br /><br />print 'People: %d Derangements: %f' % (<br /> people, float(derangements) / simulations)<br /><br /></pre><br />Running this program, we get (maybe, because we get different results each time we run it):<br /><pre>People: 10 Derangements: 0.371900<br />People: 20 Derangements: 0.370300<br />People: 30 Derangements: 0.370500<br />People: 40 Derangements: 0.368200<br />People: 50 Derangements: 0.360000<br />People: 100 Derangements: 0.372100<br /></pre><br />It is hard to tell from these results whether we are converging to a value. Ideally we should be doing about a million events for each simulation and using maybe a million people (we are trying to find out where the series converges when the amount of people approaches infinity). I am sure that there will be some very simple value for the final answer. So, over to Alasdair...<br /><br />To look at the mathematics behind derangements, lets call the number of derangements of n objects D(n). A bit of fiddling will show that D(2)=1, D(3)=2 and D(4)=9. For example, to show that D(4)=9, you can list all the permutations of 4 objects, and find which of those are derangements:<br /><br /><table><tbody><tr> <td>1</td> <td>2</td> <td>3</td> <td>4</td> <td><br /></td> </tr><tr> <td>1</td> <td>2</td> <td>4</td> <td>3</td> <td><br /></td> </tr><tr> <td>1</td> <td>3</td> <td>2</td> <td>4</td> <td><br /></td> </tr><tr> <td>1</td> <td>3</td> <td>4</td> <td>2</td> <td><br /></td> </tr><tr> <td>1</td> <td>4</td> <td>2</td> <td>3</td> <td><br /></td> </tr><tr> <td>1</td> <td>4</td> <td>3</td> <td>2</td> <td><br /></td> </tr><tr> <td>2</td> <td>1</td> <td>3</td> <td>4</td> <td><br /></td> </tr><tr> <td>2</td> <td>1</td> <td>4</td> <td>3</td> <td>*</td> </tr><tr> <td>2</td> <td>3</td> <td>1</td> <td>4</td> <td><br /></td> </tr><tr> <td>2</td> <td>3</td> <td>4</td> <td>1</td> <td>*</td> </tr><tr> <td>2</td> <td>4</td> <td>1</td> <td>3</td> <td>*</td> </tr><tr> <td>2</td> <td>4</td> <td>3</td> <td>1</td> <td><br /></td> </tr><tr> <td>3</td> <td>2</td> <td>1</td> <td>4</td> <td><br /></td> </tr><tr> <td>3</td> <td>2</td> <td>4</td> <td>1</td> <td><br /></td> </tr><tr> <td>3</td> <td>1</td> <td>4</td> <td>2</td> <td>*</td> </tr><tr> <td>3</td> <td>1</td> <td>2</td> <td>4</td> <td><br /></td> </tr><tr> <td>3</td> <td>4</td> <td>1</td> <td>2</td> <td>*</td> </tr><tr> <td>3</td> <td>4</td> <td>2</td> <td>1</td> <td>*</td> </tr><tr> <td>4</td> <td>1</td> <td>2</td> <td>3</td> <td>*</td> </tr><tr> <td>4</td> <td>1</td> <td>3</td> <td>2</td> <td><br /></td> </tr><tr> <td>4</td> <td>2</td> <td>1</td> <td>3</td> <td><br /></td> </tr><tr> <td>4</td> <td>2</td> <td>3</td> <td>1</td> <td><br /></td> </tr><tr> <td>4</td> <td>3</td> <td>1</td> <td>2</td> <td>*</td> </tr><tr> <td>4</td> <td>3</td> <td>2</td> <td>1</td> <td>*</td> </tr></tbody></table><br />If you like a hard time, then you can find that D(5)=44. Beyond that the amount of work becomes too big for this sort of brute force calculation.<br /><br />So then, without listing all permutations, how could we go about finding D(5)? Well, start with five objects<br /><p><br />1 2 3 4 5<br /><br /></p><br />Clearly we can't keep just one object in place and permute the other four. So we have to change the positions of at least two objects, and permute the other three. For example, we could swap 1 and 2:<br /><p><br />2 1 * * *<br /></p><br />and then there are two derangements of the other three. And in fact this gives as a method of counting all derangements. The number 1 can also swapped with three other numbers:<br /><table><tbody><tr> <td>3</td> <td>1</td> <td>*</td> <td>*</td> <td>*</td> </tr><tr> <td>4</td> <td>1</td> <td>*</td> <td>*</td> <td>*</td> </tr><tr> <td>5</td> <td>1</td> <td>*</td> <td>*</td> <td>*</td> </tr></tbody></table><br />and in each case the derangements of the rest can be counted. How might we count the number of derangements in the case where say, 4 is in the first place? Like this:<br /><p><br />4 * * * *<br /></p><br />We consider two cases:<br /><ol><li>The 4 and the 1 are swapped: 4 * * 1 *</li><li>The 1 goes into some other place, such as 4 * 1 * *</li></ol><br />Now to count the number of derangements in each case.<br /><ol><li>The other numbers 2, 3, and 5, have not been moved. So in this case the total number of derangements is the number of derangements of three objects: D(3)=2.</li><li>Suppose we try to count the number of derangements now where the 1 is <i>not</i> in the fourth place. This means we have:<p><br /><table><tbody><tr> <td>Original:</td> <td>1</td> <td>2</td> <td>3</td> <td>4</td> <td>5</td> </tr><tr> <td>After the swap:</td> <td>4</td> <td>2</td> <td>3</td> <td>1</td> <td>5</td> </tr></tbody></table><br /></p><br />Forgetting about the initial four, the number of derangements is the number of derangements of the other four: this will ensure that 1 does not end up in the fourth place. And this number is D(4)=9.</li></ol><br />What we have shown now is that the number of derangements with 4 in the first place is D(3)+D(4)=2+9=11. But there are four possible values for the first place (every number other than 1), so the total number of derangements is 4 x (D(3)+D(4)) = 4 x 11 =44. And this is the value of D(5). Applying this argument to the general case leads to the recursive formula<br /><p><br /><br />D(n)=(n-1)(D(n-1)+D(n-2))<br /></p><br />The next few values are easily determined: D(6)=5(D(4)+D(5))=5(9+44)=265. Then D(7)=6(D(5)+D(6))=6(44+265)=1854. And so on.<p><br /><br />The question now is to determine a non-recursive formula. And for this we are going to use the inclusion-exclusion principle, which you may have seen in relation to the sizes of set intersections and unions. For two sets, it says that: $$|A\cup B|=|A|+|B|-|A\cap B|$$ That is, the number of elements in a set union is the sum of the set cardinalities, minus the cardinality of the set intersection. The reason for subtracting the set intersection is that when the cardinalities of the two sets are added, each element in the union is counted twice.<br /><br />For three sets, we have $$|A\cup B\cup C|=|A|+B|+|C|-|A\cap B|-|A\cap C|-|B\cap C|+|A\cap B\cap C|$$ So first the (cardinalities of the) sets are added; then the intersections of all pairs of sets subtracted, and finally the intersection of all sets added. And this principle can be extended to an arbitrary number of sets. First add the cardinalities of all the sets, then subtract the cardinalities of all intersections of pairs of sets, then add the cardinalities of all intersections of triples of sets, then subtract the cardinalities of all intersections of quadruples of sets, then add... you get the idea!<br /><br />What does this have to do with derangements? Well, consider all the derangements of four objects, and let $$A$$ be the set in which 1 is in its correct place, $$B$$ the set for which 2 is in the correct place, and $$C$$ and $$D$$ set sets with 3 and 4 in the correct place respectively. We can now use the inclusion-exclusion principle to determine the number of permutation where at least one element is in the correct place: that is, the size of the union of $$A$$, $$B$$, $$C$$ and $$D$$. Now the size of each individual set is $$3!$$. The size of each intersection of two sets is $$2!$$ (two elements are fixed, leaving two to move), and there are<br /><br />$${4\choose 2}=6$$<br />possible pairs. The size of each intersection of three sets is $$1!$$ (three elements are fixed, leaving only one to move), and there are<br /><br />$${4\choose 3}=4$$<br />possible triples. Thus we have<br /><br />$$A\cup B\cup C\cup D = 4(3!)-{4\choose 2}2!+{4\choose 3}1!+{4\choose 4}0!$$.<br />Writing each binomial coefficient in terms f factorials and simplifying, this sum becomes<br /><br />$$4!-\frac{4!}{2!}+\frac{4!}{3!}-\frac{4!}{4!}$$<br />or as<br /><br />$$4!(\frac{1}{1!}-\frac{1}{2!}+\frac{1}{3!}-\frac{1}{4!})$$.<br />But this value (call it $$v$$) is the number of permutations in which at least one element is fixed. The number of derangements is $$4!-v$$, or<br /><br />$$4!(1-\frac{1}{1!}+\frac{1}{2!}-\frac{1}{3!}+\frac{1}{4!})$$.<br />In general, then:<br /><br />$$D(n)=n!(1-\frac{1}{1!}+\frac{1}{2!}-\frac{1}{3!}+\frac{1}{4!}-\cdots +(-1)^n\frac{1}{n!}).$$<br />If we divide both sides by $$n!$$, and take the limit as n rockets off to infinity, we have<br /><br />$$\lim_{n\to\infty}\frac{D(n)}{n!}=1-\frac{1}{1!}+\frac{1}{2!}-\frac{1}{3!}+\frac{1}{4!}-\cdots$$<br />and the right hand side can be seen to be $$e^{-1}\approx 0.36787944$$, which is what was obtained above by experiment.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-40848890249119453292011-06-04T22:03:00.001-07:002011-06-04T22:07:02.571-07:00Sorting without sortingIt is Sunday, the children have gone to visit their grandparents and I should be writing an essay this afternoon for my Graduate Certificate in management.<br /><br />Instead I was browsing the web and came across a <a href="http://codekata.pragprog.com/2007/01/kata_eleven_sor.html">Kata</a> by Dave Thomas from Pragmatic<br />Programmers.<br /><br />The idea of Katas is that they are exercises that allow programmers to keep<br />their skills in tune by continually solving small programming (or sometimes<br />other) problems. In this particular kata, Dave offers the following problem:<br /><br /><i>Our resident conspiracy expert, Dr. X, is looking for hidden messages in the<br />collected publications of Hugh Hefner. Dr. X believes the message is hidden in<br />the individual letters, so rather than get distracted by the words, he’s asked<br />us to write a program to take a block of text and return the letters it<br />contains, sorted. Given the text:<br /><br />When not studying nuclear physics, Bambi likes to play<br />beach volleyball.</i><br /><br />our program would return:<br /><br /><tt>aaaaabbbbcccdeeeeeghhhiiiiklllllllmnnnnooopprsssstttuuvwyyyy</tt><br /><br /><i><br />The program ignores punctuation, and maps upper case to lower case.<br /><br />Are there any ways to perform this sort cheaply, and without using built-in libraries? <br /></i><br /><br />I decided to try a solution quickly in Python and came up with the following<br />solution:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">sentence = '''<br /> When not studying nuclear physics, Bambi likes to play<br /> beach volleyball."<br />'''<br /><br />frequency = [0] * 26<br /><br />def increment_frequency(letter):<br /> frequency[ord(letter) - ord('a')] += 1<br /><br />[increment_frequency(letter) for letter in sentence.lower() if letter >= 'a' <br /> and letter <= 'z']<br /><br />print ''.join([chr(i + ord('a')) * frequency[i] for i in<br /> range(len(frequency))])<br /><br /></pre><br />What I have done is to create an array of 26 elements, one for each letter of the alphabet, and each time I see a letter, I increment the count for that letter. In other words, we don't sort the letters, we just keep track of the frequency of each letter. Finally, we print out the contents of the array as a histogram, using the letter to represent the bar.<br /><br />I did not come up with the solution - I am sure that I have seen it before. Probably more than once. But I think doing the katas is a good way to keep that sort of thing in front of the mind. Also, while doing it, I made the mistake that I often make of using iter.join(str), rather than using str.join(iter).<br /><br />Okay, now on to my essay...mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-59075364137361796232011-04-11T03:33:00.000-07:002011-04-11T03:33:50.451-07:00Miller-Rabin to the rescueI love words and I love primes, so when I came across this <a href="http://primes.utm.edu/notes/words.html">article</a> on prime words, naturally I was tempted to try to implement it in Python.<br /><br />In case you can't click through to the article, here is what we want to do. In base 36 every letter of the alphabet represents a digit. So every word is a number. The digit A has the value 11, B has the value 12 and D has the value 14, so BAD would represent 12*(36^2) + 11*(36^1) + 14*(36^0).<br /><br />This value may or may not be prime. Our program will read all the words in a dictionary and print out the words which have a prime value.<br /><br />Here is my first (naive) attempt:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">#<br /># Read a dictionary and print out all words which are prime base 36.<br />#<br /><br />import math<br /><br />def isprime(num):<br /> '''<br /> Return true if a number is prime, else false.<br /> '''<br /><br /> # most of the numbers will be tested by the next four lines<br /> if num == 2 or num == 3 or num == 5:<br /> return True<br /> if num % 2 == 0 or num % 3 == 0 or num % 5 == 0:<br /> return False<br /> <br /> # For the remainder...<br /> n = 5<br /> while (n <= math.sqrt(num)):<br /> if num % n == 0:<br /> return False<br /> n += 2<br /> return True<br /><br /><br />for word in open('/usr/share/dict/words'):<br /> if isprime(int(word.rstrip().upper(), 36)):<br /> print(word)<br /><br /></pre> So, what is wrong with it? Well, I started running it and I started to see a few words, for example: <pre>Ababdeh<br />abacist<br />abandon<br /></pre>But then things slowed down. As soon as I thought about it for a few seconds, I realised that some words probably have quite high integer values. For example, let's take <tt>ankyloblepharon</tt> (the condition where your eyelids are stuck together). Let's convert this to an integer base 64: <pre>>>> int('ankyloblepharon', 36)<br />65432123906951596638119L<br /></pre>Okay, that's a pretty big number. Fortunately we only test half the numbers in the range. That's still pretty big. Even more fortunate, we only test up the the square root of the number which gives us: <pre>>>> math.sqrt(int('ankyloblepharon', 36) /2)<br />180875819150.80798<br /></pre>Adding a few commas, we get: 180,875,819,150 or to put it more simply, 180 billion tests. Clearly we need something more efficient. Luckily there is a test for primality called Miller-Rabin which is a "probabilistic" test. We test a number to see if it's prime. If the test tells us the number is composite, we are sure it is composite. If the test tells us the number is prime, we are pretty sure it is prime. Depending on the parameter we use, pretty sure can mean something like 99.999999999999% sure. I used the <a href="http://en.wikipedia.org/wiki/Miller-Rabin">Wikipedia definition of Miller-Rabin</a>. Here is the revised code: <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">#<br /># Read a dictionary and print out all words which are prime base 36.<br />#<br /><br />import math<br /><br /><br />#<br /># Implementation of Miller-Rabin based on Wikipedia algorithm<br />#<br /># Algorithm:<br /># Input: n > 3, an odd integer to be tested for primality;<br /># Input: k, a parameter that determines the accuracy of the test<br /># Output: composite if n is composite, otherwise probably prime<br /># write n - 1 as 2^s*d with d odd by factoring powers of 2 from n - 1<br /># LOOP: repeat k times:<br /># pick a random integer a in the range [2, n - 2]<br /># x <- a^d mod n<br /># if x = 1 or x = n - 1 then do next LOOP<br /># for r = 1 .. s - 1<br /># x <- x2 mod n<br /># if x = 1 then return composite<br /># if x = n - 1 then do next LOOP<br /># return composite<br />#return probably prime<br /># <br />#<br />import random<br /><br />def inner_test(n, s, x):<br /> for r in range(1, s):<br /> x = pow(x, 2, n)<br /> if x == 1:<br /> return False<br /> if x == n - 1:<br /> return True <br /> return False <br /><br /><br />def miller_rabin(n):<br /> if n < 2:<br /> return False<br /><br /> if n % 2 == 0:<br /> return n == 2<br /><br /> k = 30 # Number of tests<br /> s = 0 <br /> d = n - 1<br /> while d % 2 == 0:<br /> s += 1<br /> d = d / 2<br /> for i in range(k):<br /> a = random.randint(2, n - 1)<br /> x = pow(a, d, n) <br /> <br /> if x == 1 or x == n - 1:<br /> continue<br /> <br /> if not inner_test(n, s, x):<br /> return False<br /> return True<br /><br /><br /><br />for word in open('/usr/share/dict/words'):<br /> if miller_rabin(int(word.rstrip().upper(), 36)):<br /> print(word)<br /><br /></pre><br />So, how much better was the revised code? Well, in less time than it took to print the first four words using the first test, the revised test found 12,000+ words.<br /><br />Interestingly, there is also a deterministic version of Miller-Rabin. According to the Wikipedia article, we can test numbers up to 341,550,071,728,321 using only the following numbers: 2, 3, 5, 7, 11, 13 and 17. I have a feeling that limit should cover most of the words in our dictionary, but that is a task for another day. I think a speed up of about 4,000 times is sufficient for one day.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-4672889702425416792011-03-01T18:00:00.000-08:002011-03-01T18:00:25.226-08:00When programming languages are being discussed on forums, one of the topics that is almost always an issue is the kind of community in which the language exists (and I am using the idea of the language existing within a community quite deliberately). I have found that many languages have quite a vibrant community where people are willing to share thoughts and ideas and are supportive of those who are learning the language. A few weeks ago, though, I was lucky to see first hand the excellent spirit of the Python community. David Beazley, Python expert, author of SWIG, the Python Essential Reference and all round nice guy was travelling to Canberra, Australia to do some training. He offered, at no expense, to visit Melbourne and give us a sneak preview of the workshop that he will be delivering at PyCon2011. The workshop was pretty well attended for a workshop on a Saturday that was held at very short notice. The workshop was titled 'Mastering Python 3 I/O' and my first response was how much can we talk about I/O in Python 3 that I don't already know. After all, we all know that 'print x' has been replaced by 'print(x)'. Well, as it turns out, there was a lot that I don't know. Of the almost 150 slides that we covered, I found myself taking a note or two almost every second or third slide. The talk was broken down into 8 easily digestible parts. It started with simple issues like porting from 2 to 3 (and why 2to3 cannot cope with many I/O issues), progressed to a very nice overview on Unicode (an understanding of which is important in understanding Python 3) and continued through to system interfaces and library design issues. Here are just a few of my dotpoints from the talk (I am tempted to include more, but do not want to preempt Dave's talk at PyCon):<br /><br /> * All strings are Unicode strings in Python 3.<br /> * Most of the Unicode issues are handled transparently.<br /> * 2to3 will not address many I/O issues in migration.<br /> * Python 3 reimplements most of the I/O stack.<br /> * Python 3 has introduced a new formatting model.<br /> * Much of this model has been backported to Python (in particular, 2.7).<br /> * There are potentially lots of gotchas in Python 3. Forewarned is forearmed. <br /><br />I think it is a bit unfortunate that Python 3 is mentioned in the title. This talk is relevant to all Python programmers - even if you don't intend using Python 3 for a while. If you're attending PyCon and have a chance to attend Dave's workshop, I can recommend it. If you don't get a chance, hopefully the slides will be available. Finally, thanks Dave. You're a credit to the Python community and all of us in Melbourne enjoyed your breezy Chicago style. <br /><br />Note: After the conference, the slides should be available here: http://www.dabeaz.com/talks.htmlmudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-31966772576180236612011-02-06T02:21:00.000-08:002011-02-19T12:57:41.745-08:004x4s in PythonI recently came across a <a href="http://www.newcave.com/game/?id=3456">nice little game</a>. As far as I can tell, you are required to make as many numbers as possible using only four 4s. So, for example:<br />4 * 4 / 4 * 4 = 1<br />(4 / 4) + (4 / 4) = 2<br />(4 + 4 + 4) / 4 = 3<br />etc.<br /><br />I played the game until I got up to about 5 then thought it would be much more rewarding to write a little Python program to see how many numbers under 100 I could generate using only four 4s.<br /><br />There may be a better way to do this, but my simple approach was to generate all the permutations of four 4s together with the permissible set of operators (ie. +, -, x, / and ^). This gives us 7 characters. However, we also need to consider parentheses since (4 + 4 + 4) / 4 gives a very different result to 4 + 4 + (4 / 4). As soon as we include parentheses, we start getting into 'intractable land'. In other words, the number of permutations that we can generate becomes impractical. As anyone who has ever owned an HP calculator (the <a href="http://en.wikipedia.org/wiki/HP-10C_series">real ones</a> manufactured in the 1980s) knows, any parenthesised expression can be expressed in <a href="http://en.wikipedia.org/wiki/Reverse_Polish_Notation">Reverse Polish Notation</a> without parentheses. Very briefly, this means that we use a stack on push operands onto the stack. As soon as we encounter an operator (we will assume we only have binary operators), we pop two values from the stack, apply the operator and push the result back onto the stack. We can now express the previous two examples using RPN (Reverse Polish Notation).<br /><br /><tt>(4 + 4 + 4) / 4</tt> becomes <tt>444++4/</tt><br /><br />and<br /><br /><tt>4 + 4 + (4 / 4)</tt> becomes <tt>4444/++</tt><br /><br />Fortunately, writing a function to evaluate an RPN expression is straightforward. Here is one implementation.<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">def eval_rpn(exp):<br /> stack = []<br /> for ch in exp:<br /> if isdigit(ch):<br /> stack.append(int(ch))<br /> if isoperator(ch):<br /> if len(stack) < 2:<br /> return None<br /> rhs = stack.pop()<br /> lhs = stack.pop()<br /> if ch == '+':<br /> stack.append(lhs + rhs)<br /> elif ch == '-':<br /> stack.append(lhs - rhs)<br /> elif ch == '*':<br /> stack.append(lhs * rhs)<br /> elif ch == '/':<br /> if rhs == 0:<br /> return None<br /> stack.append(lhs / rhs)<br /> elif ch == '^':<br /> stack.append(lhs ** rhs)<br /> result = stack.pop()<br /> return result<br /></pre>Once we have this, the rest is easy. We just generate all permutations of the four digits and the allowable operators. We end up with the following program: <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">'''<br />A program to evaluate each combination of 4 digits<br />'''<br /><br />import itertools<br /><br />operators = ['+','-','*','/','^']<br />digits = ['4','4','4','4']<br /><br /><br />pattern = operators + digits<br /><br />results = {}<br /><br />def isdigit(ch):<br /> return ch in '0123456789'<br /><br /><br />def isoperator(ch):<br /> return ch in '+-*/^'<br /><br />def eval_rpn(exp):<br /> stack = []<br /> for ch in exp:<br /> if isdigit(ch):<br /> stack.append(int(ch))<br /> if isoperator(ch):<br /> if len(stack) < 2:<br /> return None<br /> rhs = stack.pop()<br /> lhs = stack.pop()<br /> if ch == '+':<br /> stack.append(lhs + rhs)<br /> elif ch == '-':<br /> stack.append(lhs - rhs)<br /> elif ch == '*':<br /> stack.append(lhs * rhs)<br /> elif ch == '/':<br /> if rhs == 0:<br /> return None<br /> stack.append(lhs / rhs)<br /> elif ch == '^':<br /> stack.append(lhs ** rhs)<br /> result = stack.pop()<br /> return result<br /><br /><br /># Try every permutation of 7 characters...<br />for exp in itertools.permutations(pattern, 7):<br /> e = ''.join(exp)<br /> val = eval_rpn(e)<br /> if val is not None and int(val) == val and val >= 0:<br /> results[val] = ''.join(e) <br /><br />for i in range (1, 100):<br /> if i in results:<br /> print results[i], ' = ', i<br /></pre>When we run the program, we get the following output: <pre>4444*/^ = 1<br />444+4/- = 2<br />444/4^- = 3<br />4444^/- = 4<br />4444-^+ = 5<br />4444/-+ = 7<br />4444/^+ = 8<br />4444/-* = 12<br />44*44/- = 15<br />4444/^* = 16<br />44/44*+ = 17<br />4444/+* = 20<br />444+*4- = 28<br />44^44+/ = 32<br />44^4/4- = 60<br />44^4-4/ = 63<br />4444/-^ = 64<br />444^+4/ = 65<br />444^4/+ = 68<br />444/-4^ = 81<br /></pre><br /><br />After doing this, I noticed that we had more gaps than I expected. I reread the rules of the game and realised that an expression like 44 + 44 is also valid. However, I don't think it's a hard feature to add (or is it?). If I was going to spend a little more time on this program, it would be to present the results in infix notation. Maybe for the moment I will leave that as an exercise for the reader.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com1tag:blogger.com,1999:blog-8342851416423405349.post-62209905622407543162011-01-26T13:55:00.000-08:002011-02-19T13:00:57.319-08:00Handling dates with regular expressionsI have a confession to make. Part of the reason I have created this blog is as a memo to myself for little features that I have come across while using Python. I find myself being faced again and again with the same sort of problem and then I cannot remember in which script I last used it, so I go looking through all my scripts. There is another little confession; I used to lecture and enjoyed it. I am hoping that occasionally someone interested in learning Python will stumble across this blog and find something useful in it. Maybe even hoping that someone not interested in Python will stumble across this blog and become interested in Python.<br /><br />Anyway, this morning I came across a request from a colleague who wanted to be able to ingest dates - but dates come in different formats, eg. <tt>15/04/1707</tt> and <tt>1777-04-30</tt> and even <tt>16430104</tt>.<br /><br />One way to treat these dates would be to split them using the split function and then try to work out which part of the date is the year, the month and the day. Another method is to use regular expressions - and that is the topic of this post.<br /><br />I had some simple code which I developed a while back to attack exactly this type of problem. This is the code:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">import re<br />date_format_yyyy_mm_dd = re.compile('^\d\d\d\d-\d\d-\d\d$')<br />date_format_dd_mm_yyyy = re.compile('^\d\d/\d\d/\d\d\d\d$')<br />date_format_yyyymmdd = re.compile('^\d{8}$')<br /><br />stn_num_format = re.compile('^\d+$')<br /><br />'''<br />Validate a date field<br />'''<br />def valid_date(date):<br /> return ((date_format_dd_mm_yyyy.match(date) is not None)<br /> or (date_format_yyyy_mm_dd.match(date) is not None)<br /> or (date_format_yyyymmdd) is not None)<br /></pre><br /><br />The function <tt>valid_date</tt> will return true if a date uses one of the<br />formats above. We could, however, format our regular expressions more<br />elegantly as follows:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">date_format_yyyy_mm_dd = re.compile('^\d{4}-\d{2}-\d{2}$')<br />date_format_dd_mm_yyyy = re.compile('^\d{2}/\d{2}/\d{4}$')<br /></pre><br /><br />In the code above, once we know we have a valid date, we still have to parse<br />the date. In my original code, I used did the following: <br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">def normalise_date(date):<br /> '''<br /> Normalise a date - ie. given a variety of date formats<br /> return a date.<br /> '''<br /> if date_format_dd_mm_yyyy.match(date) is not None:<br /> return datetime.date(int(date[6:10]), int(date[3:5]), int(date[0:2]))<br /><br /> if date_format_yyyy_mm_dd.match(date) is not None:<br /> return datetime.date(int(date[0:4]), int(date[5:7]), int(date[8:10])) <br /><br /> if date_format_yyyymmdd.match(date) is not None:<br /> return datetime(int(date[0:4]), int(date[4:6]), int(date[6:8]))<br /><br /> datafile.write('ERROR in normalise date: ' + date)<br /></pre><br /><br />The code works - but is not particularly elegant. In particular, if we add a<br />new date format, we also have to add extra code to <pre>normalise_date()</pre><br /><br />Fortunately regular expressions allow us to create groups and<br />automatically refer to them. We do this as follows:<br /><br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">date_format_yyyy_mm_dd = re.compile('^(\d{4})-(\d{2})-(\d{2})$')<br />date_format_dd_mm_yyyy = re.compile('^(\d{2})/(\d{2})/(\d{4})$')<br /></pre><br /><br />Now, when we do a match, we can refer to each group. So, for example, the<br />following code:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">import re<br />exp = re.compile('^(\d{4})-(\d{2})-(\d{2})$')<br />date = '2009-12-31'<br />m = exp.match(date)<br />print m.group(1) # Year<br /></pre><br /><br />The above code prints 2009. We can, however, make the code clearer. We can<br />name each group and then refer to the group by name. This brings us to our<br />final example.<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">import re<br /><br />date_format_yyyy_mm_dd = re.compile(<br />'^(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})$')<br />date_format_dd_mm_yyyy = re.compile(<br />'^(?P<day>\d{2})/(?P<month>\d{2})/(?P<year>\d{4})$')<br />date_format_yyyymmdd = re.compile(<br />'^(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})$')<br /><br />birthdays = ['15/04/1707', '1777-04-30', '16430104']<br /><br />date_formats = [date_format_yyyy_mm_dd, date_format_dd_mm_yyyy,<br /> date_format_yyyymmdd]<br /><br /><br />for d in birthdays:<br /> for f in date_formats:<br /> m = f.match(d)<br /> if m:<br /> print 'Date: %s/%s/%s' % (m.group('day'), m.group('month'),<br /> m.group('year'))<br /><br /><br /></pre><br /><br />This will give us the following output:<br /><br /><pre>Date: 15/04/1707<br />Date: 30/04/1777<br />Date: 04/01/1643<br /></pre><br /><br />Okay, exactly what we expected. Now, how do we handle the fact that in the US<br />dates are written MM/DD/YYYY? Don't even dare to go there!<br /><br />ps. I know at least one reader of this blog for whom the dates will be<br />meaningful.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-87320164360392054772011-01-25T21:39:00.000-08:002011-02-19T13:05:23.164-08:00A mathematical curiosityToday I came across <a href="http://www.futilitycloset.com/2011/01/16/a-surprise-visitor/"> this </a> wonderful little curiosity.<br /><br />I was going to write a little Python program to try it out. Basically, we are creating two fibonacci sequences, one starting with 12, 18 and the other starting with 5, 5. At first I thought there will be nothing new here. Then I remembered that I was looking for a good example with which to demonstrate generator functions and I figured out that this would be a perfect example. Basically, we are looking for a function that each time we call it will generate the next fibonacci number in a sequence.<br /><br />There is an example of such a Fibonacci function in an earlier post which I will repeat here:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">def fibonacci():<br /> parent, baby = 1, 0<br /><br /> while baby < 1000:<br /> print baby<br /> parent, baby = (parent + baby, parent)<br /></pre>In this case, we don't want to print the fibonacci number, but want to return the next number in the series. We can do this with a generator function. A generator function is one that has the keyword <tt>yield</tt>. <tt>yield</tt> is like a return statement, except that the function remembers where we are and then the next time we call that function, we start from where we left off (or, in other words, directly after <tt>yield</tt>). Here is our fibonacci generator: <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">def fibonacci(t1, t2):<br /> while True:<br /> yield t1<br /> t1, t2 = t2, t1 + t2<br /><br />fib = fibonacci(0, 1)<br />for i in range(10):<br /> print fib.next()<br /><br /></pre>Running this we get: <pre>0<br />1<br />1<br />2<br />3<br />5<br />8<br />13<br />21<br />34<br /></pre>Now comes the interesting part - we can create as many generators as we want, so here is our final program: <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">def fibonacci(t1, t2):<br /> while True:<br /> yield t1<br /> t1, t2 = t2, t1 + t2<br /><br />lhs = fibonacci(12, 18)<br />rhs = fibonacci(5, 5)<br />for i in range(30):<br /> l = lhs.next()<br /> r = rhs.next()<br /> print '%10d %10d %8.7f' % (l, r, float(l) / r)<br /></pre>So, when I run it, I get the following results: <pre>12 5 2.4000000<br />18 5 3.6000000<br />30 10 3.0000000<br />48 15 3.2000000<br />78 25 3.1200000<br />126 40 3.1500000<br />204 65 3.1384615<br />330 105 3.1428571<br />534 170 3.1411765<br />864 275 3.1418182<br /></pre>This looks promising and it seems to converge really quickly. So, lets run it for 20 iterations. <pre>12 5 2.4000000<br />18 5 3.6000000<br />30 10 3.0000000<br />48 15 3.2000000<br />78 25 3.1200000<br />126 40 3.1500000<br />204 65 3.1384615<br />330 105 3.1428571<br />534 170 3.1411765<br />864 275 3.1418182<br />1398 445 3.1415730<br />2262 720 3.1416667<br />3660 1165 3.1416309<br />5922 1885 3.1416446<br />9582 3050 3.1416393<br />15504 4935 3.1416413<br />25086 7985 3.1416406<br />40590 12920 3.1416409<br />65676 20905 3.1416408<br />106266 33825 3.1416408<br /></pre>Well, that's weird. It got really close to Pi (which, as far as I can remember is something like 3.1415927), but then it seems to wander off again. Well, maybe it oscillates for a while before settling down. So, let's try 30 iterations (I will just record the last few lines of output). <pre>728358 231840 3.1416408<br />1178508 375125 3.1416408<br />1906866 606965 3.1416408<br />3085374 982090 3.1416408<br />4992240 1589055 3.1416408<br />8077614 2571145 3.1416408<br />13069854 4160200 3.1416408<br /></pre><br /><br />Well, it seems that we're not converging on Pi, but something very close to Pi. <br /><br />Maybe I should have just stopped after 10 iterations. I marvelled at how the ratio between two fibonacci sequences would converge on Pi. Now I am a bit wiser, but also a bit sadder.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com2tag:blogger.com,1999:blog-8342851416423405349.post-30930578529315409152011-01-09T22:15:00.000-08:002011-02-19T13:08:35.755-08:00My wife recently was doing some programming for work. The work involved creating a database that contained the names of bulls. The bulls often have interesting names like "KIRK ANDREWS FORCEFUL" or "AULDREEKIE ADDISON VACUM".<br /><br />It also happens that when people are searching for a bull, they may type in the wrong name, for example "AULDREKIE ADDISON VACUUM" or type "JISSELVLIEDT 135 BREAKOUT" instead of "IJSSELVLIEDT 135 BREAKOUT".<br /><br />So, given a (mistyped) bull name, how can we find out what we think the user probably meant? One way to do this is by using a metric known as Levenshtein distance (or edit distance). The edit distance between two strings is defined as the minimum number of inserts, deletes or substitutions of a single character required to transform one string into another. For example the following transforms all have a Levenshtein distance of 1:<br /><br /><pre>BOY -> TOY (1 substitution)<br />CHAT -> HAT (1 deletion)<br />HAT -> CHAT (1 insertion)<br /></pre><br /><br />While the following have a Levenshtein distance of 2:<br /><br /><pre>PAPER -> TAPE (1 deletion, 1 substituion)<br />TAPE -> TRADE (1 insertion, 1 substitution)<br /></pre><br /><br />Wikipedia has a very good <a href="http://en.wikipedia.org/wiki/Levenshtein_distance"> description</a> of the algorithm.<br /><br />They also have the pseudocode which I have implemented in Python. Here it is:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">'''<br />Write a function that calculates the Levenshtein distance between<br />two words.<br />'''<br /><br />def levenshtein(first, second):<br /> m = len(first) + 1<br /> n = len(second) + 1<br /><br /> # Declare an array...<br /> d = [[0] * n for j in range(m)]<br /><br /> # Initialise the array...<br /> for i in range(m):<br /> d[i][0] = i<br /><br /> for j in range(n):<br /> d[0][j] = j<br /><br /> for i in xrange(1, m):<br /> for j in range(1, n):<br /> if first[i-1] == second[j-1]: # no operation required...<br /> d[i][j] = d[i - 1][j - 1]<br /> else:<br /> d[i][j] = min( d[i - 1][j], d[i][j - 1], <br /> d[i - 1][j - 1]) + 1<br /> return d[m-1][n-1]<br /><br /><br />def main():<br /> list1 = 'great grate'.split()<br /> list2 = 'grate rate ate'.split()<br /><br /> for word1 in list1:<br /> for word2 in list2:<br /> print 'Levenshtein distance between %s and %s is %d' % (<br /> word1, word2, levenshtein(word1, word2))<br /><br /><br />if __name__ == '__main__':<br />main()<br /><br /><br /></pre><br /><br />Once again, the Python code looks very similar to the pseudocode.<br /><br />While there are many other types of algorithms for spelling correction or data scrubbing (e.g. Soundex and Metaphone), Levenshtein distance is often useful because it does not depend on the language or the phonics of the words. It is interested only in symbols, insertions, deletions and substitutions.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-16813988159327474922011-01-08T16:23:00.000-08:002011-02-19T13:11:50.390-08:00Python in LiouvilleThis morning I came across the <a href="http://www.futilitycloset.com/2011/01/08/hocus-pocus/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+FutilityCloset+%28Futility+Closet%29">following</a> little piece of mathematical trivia:<br /><br />Choose any number (e.g. 14) and write down its divisors:<br /><br /><pre>14<br />1, 2, 7, 14<br /></pre><br />Then write down the number of divisors of each of these divisors:<br /><pre>14<br />1, 2, 7, 14<br />1, 2, 2, 4<br /></pre><br />Now the square of the sum of this last group will always equal the sum of its member's cubes:<br /><pre>(1 + 2 + 2 + 4)<sup>2</sup> = 1<sup>3</sup> + 2<sup>3</sup> + 2<sup>3</sup> + 4<sup>3</sup><br /></pre><br />Discovered by Joseph Liouville.<br /><br />Well, I learned two new things today. Some mathematical trivia and that there was a French Mathematician that I had never heard of called <a href="http://en.wikipedia.org/wiki/Joseph_Liouville">Liouville</a>.<br /><br />Since I am always on the lookout for simple problems that can work as Python programming exercises, I decided to use the above problem.<br /><br />Here is my first attempt:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px; border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">def factorise(n):<br /> '''<br /> Given a number return a list of the factors of that number.<br /> '''<br /> factors = [1]<br /> i = 2<br /> while i <= n:<br /> if n % i == 0:<br /> factors.append(i)<br /> i += 1<br /> return factors<br /><br /><br />def try_num(n):<br /> factors = factorise(n)<br /> num_factors = []<br /> for factor in factors:<br /> num_factors.append(len(factorise(factor)))<br /><br /> print 'Factors: ', num_factors<br /> print 'Square of sum: ', sum(num_factors) * sum(num_factors)<br /> print 'Sum of cubes: ', sum([factor * factor * factor for factor in num_factors])<br /><br /><br />def main():<br /> try_num(14)<br /> try_num(144)<br /> try_num(65536)<br /><br />if __name__ == '__main__':<br /> main()<br /></pre> It works, but while making some small changes, I also noticed that we can do the factorising as a single list comprehension. In other words, we can replace <pre>factors = [1]<br />i = 2<br />while i <= n:<br /> if n % i == 0:<br /> factors.append(i)<br /> i += 1<br /> return factors<br /></pre> with <pre>return [x + 1 for x in xrange(n) if n % (x + 1) == 0]<br /></pre> Also, we can use a list comprehension for the loop in try_num. This led to the second version of the program: <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><br />def factorise(n):<br /> '''<br /> Given a number return a list of the factors of that number.<br /> '''<br /> return [x + 1 for x in xrange(n) if n % (x + 1) == 0]<br /><br />def try_num(n):<br /> num_factors = [len(factorise(factor)) for factor in factorise(n)]<br /><br /> print 'Factors: ', num_factors<br /> print 'Square of sum: ', sum(num_factors) ** 2<br /> print 'Sum of cubes: ', sum([factor ** 3 for factor in num_factors])<br /><br />def main():<br /> try_num(14)<br /> try_num(144)<br /> try_num(2011)<br /> try_num(65536)<br /><br />if __name__ == '__main__':<br /> main()<br /></pre><br /><br />Even after years of using Python, I am still impressed by its ability to express certain things lucidly and compact way.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com2tag:blogger.com,1999:blog-8342851416423405349.post-12149220972778676482011-01-07T17:03:00.000-08:002011-01-07T22:39:57.948-08:00Tuple packing and unpacking in PythonI was chatting to a mate who has recently started learning Python. He mentioned that one of the cool features that got him sold on Python is the fact that you can swap two values with the following idiom:<br /><br /><pre><br />x, y = y, x<br /></pre><br /><br />Yes, that is neat and quite predictable given that Python has automatic packing and unpacking of tuples. In other words, Python packs the values on the right hand side into a tuple and the assignment unpacks each value on the right hand side into the values on the left hand side.<br /><br />Extending this, we can do something like this:<br /><br /><pre><br />a = [3, 1, 4]<br />x, y, z = sorted(a)<br /></pre><br /><br />This will give us the lowest value in x and the highest value in z.<br /><br />Once again, this is a nice little hack. I was trying to think of an example where packing and unpacking really adds some value (in terms of readability or 'Pythonicness'). This morning, I accidentally came across the following<br /><a href="http://wiki.python.org/moin/SimplePrograms">example</a>:<br /><br />Suppose we want to generate a sequence of <a href="http://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers</a> (anyone who has done CS101 will recall that Fibonacci Numbers are the sequence of numbers 0, 1, 1, 2, 3, 5, 8, 13... where each number is created from the sum of the previous two numbers). <a href="http://en.wikipedia.org/wiki/Leonardo_of_Pisa">Fibonacci</a>, who went by many names, was apparently trying to calculate how many rabbits he would have if he started with one parent and each parent miraculously reproduced every 6 months having exactly one baby. After 12 months that baby would become a parent. Of course these miracle bunnies never die.<br /><br />The following program calculates the rabbit population until before it hits 1,000:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><br />parent, baby = 1, 0<br /><br />while baby < 1000:<br /> print baby<br /> parent, baby = (parent + baby, parent)<br /></pre><br /><br />I think this is a wonderful example of how one of Python's language features makes code really clear and how Python earns its reputation of 'executable pseudo-code'.<br /><br />Postscript: Another nice little way we can use this feature is as follows:<br /><pre><br />day, month, year = '31/12/2010'.split('/')<br /></pre>mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-8569639918615967602011-01-06T14:26:00.000-08:002011-02-19T13:18:22.998-08:00Winning money with PythonRecently someone posted the following on Reddit:<br /><br /><url>http://www.reddit.com/r/math/comments/erd1r/playing_card_mathematical_question_needs_your_help/</url><br /><br />Given a pack of playing cards, select two cards (at random) from the pack. Suppose these are a two and a ten. Replace the cards in the pack and shuffle. Now turn each card over in turn and look for one of the cards (eg. two or ten) followed by the other either as the next card or the card following the next card. You should have a better than 50% chance of that happening.<br /><br />There were a few posts trying to explain why this may happen and a few posts showing code (often in Python) that gave a simulation.<br /><br />The first code that I came across was the following:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><br />import random<br /><br />cards = ['Ace','King','Queen','Jack','Ten','Nine','Eight','Seven','Six','Five','Four','Three','Two']<br /><br />suits = ['Clubs','Spades','Hearts','Diamonds']<br /><br />deck = []<br /><br />for s in suits:<br /> for c in cards:<br /> deck.append(c + ' of ' + s)<br /><br />NUM_TRIALS = 100000<br /><br />wins = 0<br /><br />for trial in range(0,NUM_TRIALS):<br /> #pick two non_identical card values<br /> card_one = cards[random.randint(0,len(cards)-1)]<br /> identical = True<br /> while identical:<br /> card_two = cards[random.randint(0,len(cards)-1)]<br /> if card_one <> card_two:<br /> identical = False<br /><br />#shuffle the deck:<br />random.shuffle(deck)<br />#Separated by one?<br />sepaRated = False<br />i = 0<br />while i < len(deck) and not separated: <br /> if card_one[0:2] == deck[i][0:2]:<br /> if i+1 <= len(deck)-1:<br /> if card_two[0:2] == deck[i+1][0:2]:<br /> separated = True<br /><br /> if i+2 <= len(deck)-1:<br /> if card_two[0:2] == deck[i+2][0:2]:<br /> separated = True <br /><br /> if card_two[0:2] == deck[i][0:2]:<br /> if i+1 <= len(deck)-1:<br /> if card_one[0:2] == deck[i+1][0:2]:<br /> separated = True<br /><br /> if i+2 <= len(deck)-1:<br /> if card_one[0:2] == deck[i+2][0:2]:<br /> separated = True <br /><br /> i+=1<br /><br /> if separated:<br /> wins+=1<br /><br />print float(wins)/float(NUM_TRIALS)<br /></pre> The second solution was this one: <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">sum(map(bool,[(lambda k: k &(k >> 1 | k >> 3 | (k & ((2 << 116) - 1) / 3) <br />>> 5))(int('0b00' + ''.join(__import__('random').sample(['01']*4+['10']<br />*4+['00']*44,52)),2)<<7) for i in range(2000000)]))/2000000.0 <br /></pre> This got me thinking about Python, readibility and 'Pythonicness'. The first solution is much closer to the problem domain - almost painfully so. The cards have been modeled as cards and the values are all strings. It took me a few seconds to parse the following statement in my head: <pre>if card_two[0:2] == deck[i+1][0:2]<br /></pre> And then realised that because we are using strings for the value, we need to compare the first two characters of each string. The author of the first program claims not to have much Python experience. The second program is painfully short. It would appear that the author has going to pains to fit it all onto one line. It may appear that the author has too much Python experience (although I realise s/he was just trying to make a point). Anyway, I thought it was a nice problem and so tried to come up with a Goldilocks solution... not too much experience, not too little experience. Not too much in the problem space and not too much in the solution space. I came up with the following solution: <pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><br />'''<br />Given a deck of cards, select two cards at random<br />and run a simulation to determine how often those<br />two cards appear within two cards of each other.<br />'''<br /><br />import random<br /><br />'''<br />Since we are not concerned with the suit, we shall <br />represent the deck as 4 lists of the cards from 1..13.<br />'''<br />cards = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] * 4<br /><br />trials = 1000 # number of simulations<br /><br />def check_sequence(deck, c1, c2):<br /> '''<br /> Given a deck of cards and two single cards, return True if the<br /> two cards appear within 2 cards of each other in the deck, else<br /> return False.<br /> '''<br /> for i in range(len(deck) - 2):<br /> if deck[i] == c1 and (deck[i+1] == c2 or deck[i+2] == c2):<br /> return True<br /><br /> if deck[i] == c2 and (deck[i+1] == c1 or deck[i+2] == c1):<br /> return True<br /><br /> return False<br /><br /><br />def run_simulation():<br /> '''<br /> Run a single event.<br /> Without loss of generality, we may assume that the two<br /> random cards we selected were 1 and 2.<br /> '''<br /> random.shuffle(cards)<br /> return 1 if check_sequence(cards, 1, 2) else 0<br /><br /><br />def main():<br /> successes = 0 # number of trials where we find the two cards within 1<br /><br /> for i in xrange(trials):<br /> successes += run_simulation()<br /><br /> print float(successes) / trials<br /><br />if __name__ == '__main__':<br /> main()<br /></pre> Running this version gave me similar results to the previous two versions (about 0.73 - or, in other words, 73% of the time, we would expect to find two random cards within 1 card of each other). I was pleasantly surprised to find that my code was the most efficient of the lot (i.e. ran in the shortest time). I suspect that this was because I used <pre>xrange() </pre>rather than <pre>range()</pre>.<br /><br />In order to try to monetarise my code, I offered a friend a $5.00 bet. I lost. I took another bet and lost again.<br /><br />I guess there is a moral there somewhere, but I am not exactly sure what it is.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-25112810198777269232010-12-13T23:45:00.000-08:002011-02-17T20:35:19.013-08:00Python - Batteries includedIt has often been said that Python comes with <a href="http://www.python.org/about/">batteries included</a>. This refers to the fact that it has a rich set of utilities included in the standard library. In the following program, the main program is a single line of code. It was made possible partly by the functions included in the random module. If we had made the specifications a bit tighter, we could have reduced the number of lines significantly. <br /><br />Here is what the program does: We would like to generate passwords (or some other random string), but we want to be able to specify a template. For example, we may want to specify that each password is 7 lowercase characters. In this case we would use the template 'aaaaaaa'. Suppose we wanted a password that looks like a Victorian (Victoria, Australia) number plate, we would specify the template 'AAA-999', in other words, three random letters followed by a hyphen followed by three random digits. Finally, we may want to generate a simple password that is easy to remember by making it pronouncable - so we generate it using the following template: 'Cvcvc' The 'c' will be replaced by a consonant and the 'v' by a vowel, giving us something like, maybe, 'Gamir'.<br /><br />Here is the program:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><br />import string<br />import random<br /><br />vowels_upper = 'AEIOU'<br />vowels_lower = 'aeiou'<br />consonants_upper = 'BCDFGHJKLMNPQRSTVWXYZ'<br />consonants_lower = 'bcdfghjklmnpqrstvwxyz'<br />alphanumeric = string.letters + string.digits<br />recognize = {<br />'A': string.ascii_uppercase, 'a': string.ascii_lowercase,<br />'C': consonants_upper, 'c': consonants_lower, <br />'V': vowels_upper, 'v': vowels_lower, <br />'d': string.digits, 'x': alphanumeric}<br /><br />def create_password(template):<br />return ''.join([random.choice(recognize[ch]) <br />if ch in recognize else ch for ch in template])<br /><br />if __name__ == '__main__':<br />for i in range(5):<br />print create_password('AAA-ddd')<br /><br />for i in range(5):<br />print create_password('Cvcvc')<br /><br /></pre><br /><br />When we run the above program, we get the following output:<br /><br /><pre>MOI-138<br />KMQ-882<br />UCH-536<br />LBK-734<br />MGA-798<br />Wewof<br />Roqug<br />Colud<br />Gokix<br />Berid<br /></pre><br /><br />That's quite a lot of power for what is essentially one line of code.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-79397993321473721442010-12-11T19:23:00.000-08:002011-02-19T13:21:16.506-08:00Phononyms in PythonIf you use a mobile phone to send SMSs, and you use predictive text, you will probably notice that sometimes you type a word, e.g. 'HOME' and the predictive text thinks you are trying to type 'GOOD' since both HOME and GOOD are represented by 4663.<br /><br />Recently a friend at work was suggesting a good name for these words that are equivalent on a mobile phone would be Nokianym. I think that I came up with phononym and found that other people on the internet had already thought of the same term.<br /><br />Anyway, for me, coming up with a Python program to find all such words in a dictionary was a more interesting problem. So, here's my Python script that, given a dictionary and the layout of a keyboard, will find all words that have an equivalent numerical representation.<br /><br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%">dictionary = '/usr/share/dict/words'<br /><br />nokia = {'a':2, 'b':2, 'c':2,<br /> 'd':3, 'e':3, 'f':3,<br /> 'g':4, 'h':4, 'i':4,<br /> 'j':5, 'k':5, 'l':5,<br /> 'm':6, 'n':6, 'o':6,<br /> 'p':7, 'q':7, 'r':7, 's':7,<br /> 't':8, 'u':8, 'v':8,<br /> 'w':9, 'x':9, 'y':9, 'z':9<br />}<br /><br />def get_val(word):<br /> result = 0<br /> word = word.lower()<br /> for c in word:<br /> if nokia.has_key(c):<br /> result = result * 10 + nokia[c]<br /> return result <br /><br />phononyms = {}<br />for word in open(dictionary):<br /> word = word.rstrip('\n\r')<br /> val = get_val(word)<br /> if (not phononyms.has_key(val)):<br /> phononyms[val] = [] <br /> phononyms[val].append(word)<br /><br />print [phononyms[val] for val in phononyms.keys() if len(phononyms[val]) > 1]<br /></pre>mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com3tag:blogger.com,1999:blog-8342851416423405349.post-32365919228106439752010-12-11T18:25:00.000-08:002010-12-11T18:58:06.702-08:00UNIX and Python - a marriage of convenienceRecently we were having a discussion about how many Friday 13ths there were in a particular year (or which year had the most, or something).<br /><br />I thought it should be easy to write a solution in Python, so I started with the following code:<br /><br /><pre style="font-family: Andale Mono, Lucida Console, Monaco, fixed, monospace; color: #000000; background-color: #eee;font-size: 12px;border: 1px dashed #999999;line-height: 14px;padding: 5px; overflow: auto; width: 100%"><code><br /><br />import datetime<br />from datetime import timedelta<br /><br />start = datetime.date(2009, 1, 1)<br />end = datetime.date(2013, 1, 1)<br /><br />while start < end:<br /> print start.strftime('%Y-%m-%d %A')<br /> start = start + timedelta(1)<br /><br /></code><br /></pre><br /><br />This little program prints out all the dates from 2009-01-01 to 2012-12-31 in the following format:<br /><br /><pre><br />2009-01-01 Thursday<br />2009-01-02 Friday<br />2009-01-03 Saturday<br />2009-01-04 Sunday<br />2009-01-05 Monday<br />...<br /></pre><br /><br />I could then have added some more code to determine if a date was a Friday 13th and printed it - but decided not to. I had a general purpose tool that printed a calendar, so all I needed to do was the following:<br /><br /><pre><br />python printcal | grep '13 Friday'<br /></pre><br /><br />Which then gives the following:<br /><br /><pre><br />2009-02-13 Friday<br />2009-03-13 Friday<br />2009-11-13 Friday<br />2010-08-13 Friday<br />2011-05-13 Friday<br />2012-01-13 Friday<br />2012-04-13 Friday<br />2012-07-13 Friday<br /></pre><br /><br />So, why did I prefer to do this rather than modify my Python script. Let's look at the Unix Philosophy<br /><br /><span style="font-style: italic;">This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.</span><br /><br /><a href="http://en.wikipedia.org/wiki/Unix_philosophy">Doug McIlroy</a><br /><br />A second part of the philosophy from Mike Gancarz is "<i>Look for the 90-percent solution."<br /><br /></i>So, I have created a small tool that prints all the dates in a period (this is the 90-percent solution). It is easy to verify that it is doing what I want to do. I use the output of this program as the input to the standard unix program <code>grep</code> that finds all the output that contains Friday 13. Suppose I now want to find all years that have at 3 months where the first of the month is a Monday, I can do it easily by modifying the <code>grep</code> part of the pipeline. If I want to see how many Monday 5ths we had in 2008, I can pipe the output of <code>grep</code> to <code>wc -l</code> which will print the number of lines output.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-72718690191609632642010-11-27T14:59:00.000-08:002010-11-27T15:07:12.795-08:00Saving the planet one light at a timeI have just written my latest game and posted it to my <a href="http://www.martinschweitzer.com/lights.html">site.</a><br /><br />This game is my latest in an attempt to teach myself and get practice in using HTML5, JavaScript and Canvas. It is not an original game and is described quite well <a href="http://en.wikipedia.org/wiki/Lights_Out_%28game%29">here</a>. The game is similar to the last one, so did not take much extra effort.<br /><br />As usual, the things that took the most effort were the small things. In particular getting the bevel correct on the lights and getting the radial gradients working correctly. The documentation that I found on radial gradient was not particularly helpful in helping me understand how it works. I have left all the source code unminified which will hopefully help anyone else trying to get radial gradients to work in JavaScript.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com0tag:blogger.com,1999:blog-8342851416423405349.post-55284709041871036112010-11-21T01:37:00.000-08:002010-11-21T01:58:40.937-08:00Finding words with PrimesA long time ago I wrote a simple game. The current version of the game is here: <a href="http://www.martinschweitzer.com/wordgame01.html">wordgame</a>. The current version was to give me a chance to play with canvas and Ajax.<br /><br />Anyway, I wrote the original version a long time ago - and what I needed to do was find all words in a dictionary that could be made from the 9-letter grid. The original version was written in C and at the time the only option was to work with arrays (no hashes, STL, sets etc to help).<br /><br />After a few attempts, I "discovered" the idea of using prime numbers. I have since noticed that this idea is not original. In fact it is similar to Godel numbering. Today I came across this blog entry <a href="http://paultyma.blogspot.com/2010/11/google-interviewing-story.html">google-interviewing-story</a>. One of the comments mentioned an idea that should be more efficient than using primes. The idea is to use two arrays. We will assume we have a 'secret' word and a dictionary full of words we want to search. Our first array contains the frequency of letters in our secret word and the second array is the frequency of the letters in the word we are checking. Here is the python code for the prime method:<br /><br /><pre><br />dictionary = '../scrabble.txt'<br />secret = 'chocolate'<br />primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31,<br /> 37, 41, 43, 47, 53, 59, 61, 67, 71, 73,<br /> 79, 83, 89, 97, 101]<br /><br />def prime_val(ch):<br /> return primes[ord(ch.lower()) - ord('a')]<br /><br />def get_val(word):<br /> total = 1<br /> for ch in word:<br /> total *= prime_val(ch)<br /> return total<br /><br />magic = get_val(secret)<br />for word in open(dictionary):<br /> word = word.rstrip('\n')<br /> if magic % get_val(word) == 0:<br /> print word<br /><br /></pre><br /><br />Here is the code using arrays:<br /><br /><pre><br />#!/usr/bin/python<br /><br />dictionary = '../scrabble.txt'<br />secret = 'chocolate'<br /><br />def xord(ch):<br /> return ord(ch) - ord('a')<br /><br />def get_word_array(word):<br /> word_array = [0] * 26<br /> for letter in word:<br /> word_array[xord(letter)] += 1<br /> return word_array<br /><br /><br />magic_array = get_word_array('chocolate')<br /><br />for word in open(dictionary):<br /> word = word.rstrip('\n')<br /> word_array = get_word_array(word)<br /> magic_clone = magic_array[:]<br /><br /> if min([magic_clone[i] - word_array[i] for i in range(26)]):<br /> print word<br /></pre><br /><br />So, which method was faster?<br /><br />Here are the timings (on my iMac)<br /><br />First program using primes:<br /><pre><br />real 0m1.391s<br />user 0m1.355s<br />sys 0m0.020s<br /></pre><br /><br />The more efficient program:<br /><pre><br />real 0m3.569s<br />user 0m3.509s<br />sys 0m0.024s<br /></pre><br /><br />So, the first program would appear to be noticeably quicker when implemented in Python. There was a good argument why the second program should be quicker.<br /><br />I guess that it comes down to the language used for the implementation and the specifics of the problem. If this story has a moral, it would probably be this: Never assume that one algorithm will be more efficient than another when applied to a specific problem with a specific language.mudeltasigmahttp://www.blogger.com/profile/17001618592069897953noreply@blogger.com1