On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote: > Wojciech Mula wrote: > > Steve Bergman wrote: > > > I'm looking for a module to do fuzzy comparison of strings. [...] > > > > Check module difflib, it returns difference between two sequences. > > and it's intended for comparing text files, and is relatively slow. > > Google "python levenshtein". You'll probably find this a better fit for > typoed keys in a database. > [...]
Using the Levenshtein distance in combination with stripping "noise" characters is a good start, but the OP might want to take it a step further. One of the OP's requirements is to recognize visually similar strings, but 241O (Letter O at the end) and 241X have the same Levenshtein distance from 2410 (digit zero at the end) while the former is visually much closer to 2410 than the latter. It seems to me that this could be achieved by starting with a standard Levenshtein implementation such as http://hetland.org/python/distance.py and altering the line "change = change + 1" to something like "change = change + visual_distance(a[j-1], b[i-1])". visual_distance() would be a function that embodies the OP's idea of which character replacements are tolerable by returning a number between 0 (the two characters are visually identical) and 1 (the two characters are completely different). Hope this helps, -Carsten -- http://mail.python.org/mailman/listinfo/python-list