Re: SequenceMatcher bug ?

Tim Roberts Thu, 11 Dec 2008 00:35:50 -0800

"Gabriel Genellina" <[EMAIL PROTECTED]> wrote:

>En Wed, 10 Dec 2008 15:14:20 -0200, eliben <[EMAIL PROTECTED]> escribió:
>
>> What ? This can't be.
>>
>> 1. Go to http://try-python.mired.org/
>> 2. Type
>> import difflib
>> 3. Type
>> difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()
>>
>> Don't you get 0 as the answer ?
>
>Ah, but that isn't the same expression you posted originally:
>
>SequenceMatcher(None, [4] + [10] * 500 + [5], [10] * 500 + [5]).ratio()
>
>Using *that* expression I got near 1.0 always. But leaving out the [5] at  
>the end, it's true, it gives the wrong answer.
>...
>I've updated the tracker item.


Your assessment that it is the same problem as #1528074 is correct.  It's
the "popularity" optimization.  The key here is that the second sequence
consists of more than 200 identical items.  For example, all of the
following give the same bad result:

difflib.SequenceMatcher(None, [4] + [5] * 200, [5] * 200).ratio()
difflib.SequenceMatcher(None, [4] + [5]      , [5] * 200).ratio()
difflib.SequenceMatcher(None, [4]            , [5] * 200).ratio()

If you print get_matching_blocks(), you'll see that there are none, because
the "b" sequence is optimized completely away.  The #1528074 calls it
"working by designed" and suggests updating the doc.  However, I would
argue that it's worth checking for this.
-- 
Tim Roberts, [EMAIL PROTECTED]
Providenza & Boekelheide, Inc.
--
http://mail.python.org/mailman/listinfo/python-list

Re: SequenceMatcher bug ?

Reply via email to