Carsten Haese wrote: > On Tue, 2006-12-26 at 13:08 -0800, John Machin wrote: > > Wojciech Mula wrote: > > > Steve Bergman wrote: > > > > I'm looking for a module to do fuzzy comparison of strings. [...] > > > > > > Check module difflib, it returns difference between two sequences. > > > > and it's intended for comparing text files, and is relatively slow. > > > > Google "python levenshtein". You'll probably find this a better fit for > > typoed keys in a database. > > [...] > > Using the Levenshtein distance in combination with stripping "noise" > characters is a good start, but the OP might want to take it a step > further. One of the OP's requirements is to recognize visually similar > strings, but 241O (Letter O at the end) and 241X have the same > Levenshtein distance from 2410 (digit zero at the end) while the former > is visually much closer to 2410 than the latter. > > It seems to me that this could be achieved by starting with a standard > Levenshtein implementation such as http://hetland.org/python/distance.py > and altering the line "change = change + 1" to something like "change = > change + visual_distance(a[j-1], b[i-1])". visual_distance() would be a > function that embodies the OP's idea of which character replacements are > tolerable by returning a number between 0 (the two characters are > visually identical) and 1 (the two characters are completely different).
Ya ya ya, I could have told him a whole lot more -- please consider that what I did tell him was IMHO over-generous in response to an OT question asking for assistance with performing a Google search. ... and given his keys are described as "numbers", a better example might be 241O or 241o false-matching with 2416. ... and it might be a good idea if he ran the simplistic approach first and saw what near-misses he actually came up with before complicating it and slowing down what is already an O(N**2*L**2) exercise in the traditional/novice implementation where N is the number of keys and L is their average length. The OP needs to think about 123456789 compared with 123426789; are they the same account or not? What other information does he have? HTH, John -- http://mail.python.org/mailman/listinfo/python-list