Double Metaphone in Python and MySQL
I'm tinkering with a project that involves reading data from MP3 files or iTunes playlists and retrieving info from various web services. I was having trouble with misspelled or incomplete names so I started doing some research. Wikipedia has a short article on phonetic algorithms that includes a mention of the Double Metaphone algorithm. I had been using Soundex which is a simple algorithm, and built in to MySQL as the 'SOUNDS LIKE' SQL operator, but was matching too many things for my purpose.
Double Metaphone is a much more complicated algorithm, but its proving to be worth the trouble so far. I found a listing in C here that I ultimately used as a source. There is also a listing in Ruby that I tried starting from, but the comments weren't as thorough and it makes what I thought was slightly gratuitous use of regular expressions.
I wanted a version in Python, because I'm already using Python to work with my data in MySQL and Plone/Zope is built on Python. Besides, Python really is just fun! The item in ['list', 'of', 'items'] construct with the in operator and Python's built in string indexing made the code pretty easy to write and read.
I may have to make some tweaks to make it suit my purpose, and I might be able to optimize it a little by making it more 'Pythonic' rather than such a straight translation of the C. When I first started looking at it I was sure I'd be using a bunch of regular expressions, but it turned out to be cleaner to not use them. If I do change things much, I might be able to replace lots of lines of code with a regex.
After working with it for a while, I have found that I can get 90% of the benefit of double metaphone using this regular expression substitution - re.compile('[\Waeiouy], re.IGNORECASE).sub('', your_string).upper(). This gets rid of non-alpha characters including spaces ('\W') and vowels and converts the result to all upper case. Your mileage may vary- my data was kind of all over the place. If yours is already pretty clean and you want the last 5%- double metaphone is worth looking at.
Something you might want to examine is that the algorithm throws away every vowel but any vowel as the first character which is mapped to 'A' no matter what it is. My data suggests that it might be worth keeping final vowels in certain words. For example, 'read' and 'ready' both map to 'RD'. Accidental 'y's have got to be pretty rare so I think it should be kept- 'RDY'.
I wanted to be able to use Double Metaphone in an SQL statement, so I took a swipe at translating it into a function for MySQL. This is where I learned to realy apreciate Python! The syntax of IF...ELSEIF..ENDIF is so ugly after getting used to relying on indention in Python. I eventually got it working and its handy to be able to SELECT Name FROM tblPeople WHERE dm(Name) = dm(@search). I would recommend pre-computing the Double Metaphone values for a database of any size (UPDATE tblPeople SET NameDM = dm(Name)). I don't know quite why, but I'm finding this function to be a wee bit sluggish on my version of MySQL (5.022).