<[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > ***Check out difflib, it's in the library.*** Perfect package for what > the OP wants AFAICT.
The method in difflib is okay, but doesn't do that good a job. It is also relatively slow. My need for this was matching records in BitPim (eg which entries just read from a cell phone correspond to which entries just read from Outlook) Here are some links for the OP as well as ready to run code at the end. A paper on various algorithms and their effectiveness on various data sets. http://www.isi.edu/info-agents/workshops/ijcai03/papers/Cohen-p.pdf They do have a Java library of all the algorithms by the same authors. http://secondstring.sourceforge.net/ There is a group at the Australian National University with a project named Febrl - Freely extensible biomedical record linkage. The idea is to be able to work out if two records from the same or different sources are the same person. All their code is in Python: http://datamining.anu.edu.au/software/febrl/febrldoc/ The solution I settled on is using Jaro-Winkler. This is a quote from the paper above: A surprisingly good distance metric is a fast heuristic scheme, proposed by Jaro (Jaro 1995; 1989) and later extended by Winkler (Winkler 1999). This works almost as well as the Monge-Elkan scheme, but is an order of magnitude faster I did find a Python module that implemented this (it is part of the python-levenshtein package). Unfortunately there were some issues at the time dealing with some strings that may be unicode and some not. There was one implementation I saw that took copies of both strings and was replacing characters with asterisks. My data sets include meaningful asterisks so that seriously munged things. Ultimately I wrote my own implementation with the following properties: - The arguments can be any combination of unicode and ordinary strings - There are no special characters nor ones used internally as replacements - A bitset is used for tracking so you use an eigth of the memory - There is both Python and C code implementations so you don't have to compile anything but can if you want The code can be grabbed from: http://cvs.sf.net/viewcvs.py/bitpim/bitpim/native/strings/ There is a detailed description at the begining of the Python implementation: http://cvs.sf.net/viewcvs.py/bitpim/bitpim/native/strings/jarowpy.py?view=markup Roger -- http://mail.python.org/mailman/listinfo/python-list