harryos wrote: > I am trying to write a compare method which takes two strings and find > how many characters have changed. > > > def compare_strings(s1,s2): > pass > > > text1="goat milk" > text2="cow milk" > print compare_strings(text1,text2) > > This must give 3 ,since 3 characters are changed between strings.I was > advised to use levenshtein algorithm ..but then the matrix ops take a > long time for nontrivial strings of say 20000 characters ..Can this
I tried it with a string of 20000 chars and the python-levenshtein package that comes with ubuntu. It took about one second to calculate the distance: import functools import random import string import time from Levenshtein import distance def make_strings(n, delete, insert, swap, replace, charset=string.ascii_letters): def gen_chars(): while True: yield random.choice(charset) chars = gen_chars() a = [next(chars) for i in xrange(n)] s = "".join(a) for i in range(delete): del a[random.randrange(len(a))] for i in range(insert): a.insert(random.randrange(len(a)+1), next(chars)) for i in range(swap): p = random.randrange(len(a)-1) a[p], a[p+1] = a[p+1], a[p] for i in range(replace): a[random.randrange(len(a))] = next(chars) t = "".join(a) return s, t N = 20000 M = 100 ms = functools.partial(make_strings, N, M, M, M, M) def measure(f, *args): start = time.time() try: return f(*args) finally: print time.time() - start if __name__ == "__main__": import sys args = sys.argv[1:] if args: N, M = map(int, args) s, t = make_strings(N, M, M, M, M) print measure(distance, s, t) $ python levenshtein_demo.py 10000 1000 0.225363969803 3644 $ python levenshtein_demo.py 20000 1000 1.05217313766 4197 $ python levenshtein_demo.py 30000 1000 2.38736391068 4390 $ python levenshtein_demo.py 40000 1000 4.1686527729 4558 What would be an acceptable time? Peter -- http://mail.python.org/mailman/listinfo/python-list