On Sat, Feb 14, 2009 at 3:01 PM, Peter Otten <__pete...@web.de> wrote:
> Gabriel Genellina wrote: > > > En Fri, 13 Feb 2009 08:16:00 -0200, S.Selvam Siva < > s.selvams...@gmail.com> > > escribió: > > > >> I need some help. > >> I tried to find top n(eg. 5) similar words for a given word, from a > >> dictionary of 50,000 words. > >> I used python-levenshtein module,and sample code is as follow. > >> > >> def foo(searchword): > >> disdict={} > >> for word in self.dictionary-words: > >> distance=Levenshtein.ratio(searchword,word) > >> disdict[word]=distance > >> """ > >> sort the disdict dictionary by values in descending order > >> """ > >> similarwords=sorted(disdict, key=disdict.__getitem__, reverse=True) > >> > >> return similarwords[:5] > > > > You may replace the last steps (sort + slice top 5) by heapq.nlargest - > at > > least you won't waste time sorting 49995 irrelevant words... > > Anyway you should measure the time taken by the first part (Levenshtein), > > it may be the most demanding. I think there is a C extension for this, > > should be much faster than pure Python calculations. > > > > [I didn't see the original post] > > You can use the distance instead of the ratio and put the words into bins > of > the same length. Then if you find enough words with a distance <= 1 in the > bin with the same length as the search word you can stop looking. > > You might be able to generalize this approach to other properties that are > fast to calculate and guarantee a minimum distance, e. g. set(word). > > Peter > -- > http://mail.python.org/mailman/listinfo/python-list > Thank you all for your response, [sorry,I was away for a while.] I used functools,heapq modules but that helped me a little, then i categorized the words depending on the length and compares with a small set(each set 50000/4=12,500), so now its taking quarter of time as compared to older method. Further, can i use Thread to achieve parallel comparison ?,as i have little knowledge on python-thread. Will the following code achive parallelism? thread1= threading.Thread(target=self.findsimilar, args=("1",searchword,dic-word-set1) thread2= threading.Thread(target=self.findsimilar, args=("2",searchword,dic-word-set1) thread3= threading.Thread(target=self.findsimilar, args=("3",searchword,dic-word-set1) thread1.start() thread2.start() thread3.start() thread1.join() thread2.join() thread3.join() I would like to hear suggestion. Note:The issue is i am developing spell checker for my local languge,i may use more than 2.5 lakh words,so i need to have a best way to find out alternative wordlist -- Yours, S.Selvam
-- http://mail.python.org/mailman/listinfo/python-list