Re: Levenshtein word comparison -performance issue

Terry Reedy Fri, 13 Feb 2009 10:23:51 -0800

Gabriel Genellina wrote:

En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva<s.selvams...@gmail.com> escribió:

I need some help.
I tried to find top n(eg. 5) similar words for a given word, from a
dictionary of 50,000 words.
I used python-levenshtein module,and sample code is as follow.

def foo(searchword):
    disdict={}
    for word in self.dictionary-words:
                   distance=Levenshtein.ratio(searchword,word)
                   disdict[word]=distance
    """
     sort the disdict dictionary by values in descending order
    """
    similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True)

    return similarwords[:5]

You may replace the last steps (sort + slice top 5) by heapq.nlargest -at least you won't waste time sorting 49995 irrelevant words...


There is also no need to build the 50000 entry word-distance dictionary.

import heapq, functools

def foo(searchword, n):
  distance = functools.partial(Levenshtein.ratio, searchword)
  return heapq.nlargest(n, words, distance)

If the distances are wanted along with the similar words, I stronglysuspect that it would be faster to recalculate a small number than togenerate the dict of 50000 pairs.

Anyway you should measure the time taken by the first part(Levenshtein), it may be the most demanding. I think there is a Cextension for this, should be much faster than pure Python calculations.


And such could be dropped into the code above.

Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Re: Levenshtein word comparison -performance issue

Reply via email to