En Wed, 14 Nov 2007 14:56:25 -0300, <[EMAIL PROTECTED]> escribió: >> I'm trying to write a program to test someones typing speed and show >> them their mistakes. However I'm getting weird results when looking >> for the differences in longer (than 100 chars) strings: >> >> import difflib >> >> # a tape measure string (just makes it easier to locate a given index) >> a = >> '1-3-5-7-9-12-15-18-21-24-27-30-33-36-39-42-45-48-51-54-57-60-63-66-69 >> -72-75-78-81-84-87-90-93-96-99-103-107-111-115-119-123-127-131-135-139 >> -143-147-151-155-159-163-167-171-175-179-183-187-191-195--200' >> >> # now with a few mistakes >> b = '1-3-5-7- >> l-12-15-18-21-24-27-30-33-36-39o42-45-48-51-54-57-60-63-66-69-72-75-78 >> -81-84-8k-90-93-96-9l-103-107-111-115-119-12b-1v7-131-135-139-143-147- >> 151-m55-159-163-167-a71-175j179-183-187-191-195--200' >> >> s = difflib.SequenceMatcher(None, a ,b) >> ms = s.get_matching_blocks() >> >> print ms >> >>>>> [(0, 0, 8), (200, 200, 0)] >> >> Have I made a mistake or is this function designed to give up when the >> input strings get too long? If so what could I use instead to compute >> the mistakes in a typed text?
Yes, there are some limitations on how SequenceMatcher works. > ---------- Forwarded message ---------- > From: Evert Rol > [...] > And the part of the actual code reads: > if n >= 200 and len(indices) * 100 > n: # <--- !! > populardict[elt] = 1 > del indices[:] > else: > indices.append(i)> > So you're right: it has a stop at the (somewhat arbitrarily) limit of > 200 characters. [...]If you feel safe enough and on a fast platform, you > can probably up > that limit (or even put it somewhere as an optional variable in the > code, which I would think is generally better). If you try with a slightly shorter text (190 chars, by example) you get the expected result, pretty fast: py> s = difflib.SequenceMatcher(None, a[:190], b[:190]) py> ms = s.get_matching_blocks() py> print ms [(0, 0, 8), (9, 9, 30), (40, 40, 46), (87, 87, 11), (99, 99, 23), (123, 123, 2), (126, 126, 26), (153, 153, 15), (169, 169, 6), (176, 176, 14), (190, 190, 0)] So it appears that your strings are hitting that (arbitrary) limit. From the algorithm point of view, your strings are a rather degenerate case: so many '-' and '0' and '1's to match. Try increasing that 200 to somewhat larger than your strings. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list