In article <[EMAIL PROTECTED]>, Irmen de Jong <[EMAIL PROTECTED]> wrote: >BBands wrote: >> I'd like to see if a string exists, even approximately, in another. For >> example if "black" exists in "blakbird" or if "beatles" exists in >> "beatlemania". The application is to look though a long list of songs >> and return any approximate matches along with a confidence factor. I >> have looked at edit distance, but that isn't a good choice for finding >> a short string in a longer one. I have also explored >> difflib.SequenceMatcher and .get_close_matches, but what I'd really >> like is something like: >> >> a = FindApprox("beatles", "beatlemania") >> print a >> 0.857 >> >> Any ideas? >> >> jab >> > >I collected a few pointers in this article: > >http://www.razorvine.net/frog/user/irmen/article/2005-05-28/53 > >It contains one or two additional methods besides the ones you mentioned. >Sorry, it's in Dutch, but you can find the links without much trouble i guess.
For those who don't read Dutch, here's a translation of that page: ======================== Python </frog/user/irmen/category/2> By fuzzy string comparision, I mean comparing two strings and getting a result that indicates how similar the strings are to each other. That is, it's not an absoulte string compaision (that only tells if the two strings are absolutely the same or not), but a more quantative string comparision. It can be used to, for example, to remove typos from commands, a specific form of spelling checker or for other natural language processing. Python has a few modules which offer a fuzzy string comparision. â QWERTY - or protein distances Recently, BoB Hooft posted an interesting module in comp.lang.python which uses an algorithm that is normally used for comparing proteins, but now works for ASCII strings. The result is a float which is negative if the strings are very different and increases up to len(string) + 1 if the strings are identical, the score can be viewd as 'the number of correspondingn letters'. The algorithm uses the layout of the QWERTY keyboard to specifiy the distance between the stings and therefore is restricted to ASCII strings (and only makes sense if one uses a QWERTY keyboard), but therefor is very well suited to compare an imput command with all the possible commands and thus to identify typos. Why not use a normalised score between 0.0 and 1.0? Rob: It doesn't matter if you only want the 'best score' from a number of strings. Negative results appear if the strings are not similar and also don't even have the same length. The module can be found here: : comparestrings.py <http://starship.python.net/crew/hooft/comparestrings.py>. *Python standard library* In de standard lib hebben we difflib met de functie get_close_matches <http://docs.python.org/lib/module-difflib.html#l2h-922>. Groot voordeel dat deze dus standaard bij Python zit. The standard library has difflib with the function get_close_matches <http://docs.python.org/lib/module-difflib.html#l2h-922>. The big advantage is that this is standard with Python. *Febrl (biometrisch)* Febrl <https://sourceforge.net/projects/febrl/> has a number of string comparision functions, including the 'edit distance' or 'Levenshtein distance' function. For the latter, there is an implementation on Useless Python <http://www.uselesspython.com/download.php?script_id=108>. *Apse* An extension module (written in C): Apse <http://www.personal.psu.edu/staff/i/u/iua1/python/apse/>. Requires python, SWIG, and a C compiler *Java* SecondString <http://sourceforge.net/projects/secondstring/> is an open source package with Java implementations of a large number of string comparision algorithms. They should not be too difficult to port to Python. -- Jim Segrave ([EMAIL PROTECTED])
-- http://mail.python.org/mailman/listinfo/python-list