[issue2986] difflib.SequenceMatcher not matching long sequences

Eli Bendersky Tue, 06 Jul 2010 19:40:21 -0700

Eli Bendersky <[email protected]> added the comment:

Thanks!
Now let's see what the other devs say. The first response seems not to have
understood what you meant completely :-)


Eli

On Wed, Jul 7, 2010 at 01:18, Terry J. Reedy <[email protected]> wrote:

>
> Terry J. Reedy <[email protected]> added the comment:
>
> [Also posted to pydev for additional input, with Subject line
> Issue 2986: difflib.SequenceMatcher is partly broken
> Developed with input from Eli Bendersky, who will write patchfile(s) for
> whichever change option is chosen.]
>
> Summary: difflib.SeqeunceMatcher was developed, documented, and originally
> operated as "a flexible class for comparing pairs of sequences of any
> [hashable] type". An "experimental" heuristic was added in 2.3a1 to speed up
> its application to sequences of code lines, which are selected from an
> unbounded set of possibilities. As explained below, this heuristic partly to
> completely disables SequenceMatcher for realistic-length sequences from a
> small finite alphabet. The regression is easy to fix. The docs were never
> changed to reflect the effect of the heuristic, but should be, with whatever
> additional change is made.
>
> In the commit message for revision 26661, which added the heuristic, Tim
> Peters wrote "While I like what I've seen of the effects so far, I still
> consider this experimental.  Please give it a try!" Several people who have
> tried it discovered the problem with small alphabets and posted to the
> tracker. Issues #1528074, #1678339. #1678345, and #4622 are now-closed
> duplicates of #2986. The heuristic needs revision.
>
> Open questions (discussed after the examples): what exactly to do, which
> versions to do it too, and who will do it.
>
> ---
> Some minimal difference examples:
>
> from difflib import SequenceMatcher as SM
>
> # base example
> print(SM(None, 'x' + 'y'*199, 'y'*199).ratio())
> # should be and is 0.9975 (rounded)
>
> # make 'y' junk
> print(SM(lambda c:c=='y', 'x' + 'y'*199, 'y'*199).ratio())
> # should be and is 0.0
>
> # Increment b by 1 char
> print(SM(None, 'x' + 'y'*199, 'y'*200).ratio())
> # should be .995, but now is 0.0 because y is treated as junk
>
> # Reverse a and b, which increments b
> print(SM(None, 'y'*199, 'x' + 'y'*199).ratio())
> # should be .9975, as before, but now is 0.0 because y is junked
>
> The reason for the bug is the heuristic: if the second sequence is at least
> 200 items long then any item occurring more than one percent of the time in
> the second sequence is treated as junk. This was aimed at recurring code
> lines like 'else:' and 'return', but can be fatal for small alphabets where
> common items are necessary content.
>
> A more realistic example than the above is comparing DNA gene sequences.
> Without the heuristic SequenceMatcher.get_opcodes() reports an appropriate
> sequence of matches and edits and .ratio works as documented and expected.
>  For 1000/2000/6000 bases, the times on a old Athlon 2800 machine are
> <1/2/12 seconds. Since 6000 is longer than most genes, this is a realistic
> and practical use.
>
> With the heuristic, everything is junk and there is only one match, ''==''
> augmented by the initial prefix of matching bases. This is followed by one
> edit: replace the rest of the first sequence with the rest of the second
> sequence. A much faster way to find the first mismatch would be
>   i = 0
>   while first[i] == second[i]:
>      i+=1
> The match ratio, based on the initial matching prefix only, is spuriously
> low.
>
> ---
> Questions:
>
> 1: what change should be make.
>
> Proposed fix: Disentangle the heuristic from the calculation of the
> internal b2j dict that maps items to indexes in the second sequence b. Only
> apply the heuristic (or not) afterward.
>
> Version A: Modify the heuristic to only eliminate common items when there
> are more than, say, 100 items (when len(b2j)> 100 where b2j is first
> calculated without popularity deletions).
>
> The would leave DNA, protein, and printable ascii+[\n\r\t] sequences alone.
> On the other hand, realistic sequences of more than 200 code lines should
> have at least 100 different lines, and so the heuristic should continue to
> be applied when it (mostly?) 'should' be. This change leaves the API
> unchanged and does not require a user decision.
>
> Version B: add a parameter to .__init__ to make the heuristic optional. If
> the default were True ('use it'), then the code would run the same as now
> (even when bad). With the heuristic turned off, users would be able to get
> the .ratio they may expect and need. On the other hand, users would have to
> understand the heuristic to know when and when not to use it.
>
> Version C: A more radical alternative would be to make one or more of the
> tuning parameters user settable, with one setting turning it off.
>
> 2. What type of issue is this, and what version get changed.
>
> I see the proposal as partial reversion of a change that sometimes causes a
> regression, in order to fix the regression. Such would usually be called a
> bugfix. Other tracker reviewers claim this issue is a feature request, not a
> bugfix. Either way, 3.2 gets the fix. The practical issue is whether at
> least 2.7(.1) should get the fix, or whether the bug should forever continue
> in 2.x.
>
> 3. Who will make the change.
>
> Eli will write a patch and I will check it. However, Georg Brandel assigned
> the issue to Tim Peters, with a request for comment, but Tim never
> responded. Is there an active committer who will grab the issue and do a
> commit review when a patch is ready?
>
> ----------
>
> _______________________________________
> Python tracker <[email protected]>
> <http://bugs.python.org/issue2986>
> _______________________________________
>

----------
Added file: http://bugs.python.org/file17891/unnamed

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue2986>
_______________________________________

<div dir="ltr">Thanks! <br>Now let&#39;s see what the other devs say. The first 
response seems not to have understood what you meant completely 
:-)<br><br>Eli<br><br><br><div class="gmail_quote">On Wed, Jul 7, 2010 at 
01:18, Terry J. Reedy <span dir="ltr">&lt;<a 
href="mailto:[email protected]";>[email protected]</a>&gt;</span> 
wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 
204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im"><br>
Terry J. Reedy &lt;<a href="mailto:[email protected]";>[email protected]</a>&gt; 
added the comment:<br>
<br>
</div>[Also posted to pydev for additional input, with Subject line<br>
Issue 2986: difflib.SequenceMatcher is partly broken<br>
Developed with input from Eli Bendersky, who will write patchfile(s) for 
whichever change option is chosen.]<br>
<br>
Summary: difflib.SeqeunceMatcher was developed, documented, and originally 
operated as &quot;a flexible class for comparing pairs of sequences of any 
[hashable] type&quot;. An &quot;experimental&quot; heuristic was added in 2.3a1 
to speed up its application to sequences of code lines, which are selected from 
an unbounded set of possibilities. As explained below, this heuristic partly to 
completely disables SequenceMatcher for realistic-length sequences from a small 
finite alphabet. The regression is easy to fix. The docs were never changed to 
reflect the effect of the heuristic, but should be, with whatever additional 
change is made.<br>


<br>
In the commit message for revision 26661, which added the heuristic, Tim Peters 
wrote &quot;While I like what I&#39;ve seen of the effects so far, I still 
consider this experimental. Â Please give it a try!&quot; Several people who 
have tried it discovered the problem with small alphabets and posted to the 
tracker. Issues #1528074, #1678339. #1678345, and #4622 are now-closed 
duplicates of #2986. The heuristic needs revision.<br>


<br>
Open questions (discussed after the examples): what exactly to do, which 
versions to do it too, and who will do it.<br>
<br>
---<br>
Some minimal difference examples:<br>
<br>
from difflib import SequenceMatcher as SM<br>
<br>
# base example<br>
print(SM(None, &#39;x&#39; + &#39;y&#39;*199, &#39;y&#39;*199).ratio())<br>
# should be and is 0.9975 (rounded)<br>
<br>
# make &#39;y&#39; junk<br>
print(SM(lambda c:c==&#39;y&#39;, &#39;x&#39; + &#39;y&#39;*199, 
&#39;y&#39;*199).ratio())<br>
# should be and is 0.0<br>
<br>
# Increment b by 1 char<br>
print(SM(None, &#39;x&#39; + &#39;y&#39;*199, &#39;y&#39;*200).ratio())<br>
# should be .995, but now is 0.0 because y is treated as junk<br>
<br>
# Reverse a and b, which increments b<br>
print(SM(None, &#39;y&#39;*199, &#39;x&#39; + &#39;y&#39;*199).ratio())<br>
# should be .9975, as before, but now is 0.0 because y is junked<br>
<br>
The reason for the bug is the heuristic: if the second sequence is at least 200 
items long then any item occurring more than one percent of the time in the 
second sequence is treated as junk. This was aimed at recurring code lines like 
&#39;else:&#39; and &#39;return&#39;, but can be fatal for small alphabets 
where common items are necessary content.<br>


<br>
A more realistic example than the above is comparing DNA gene sequences. 
Without the heuristic SequenceMatcher.get_opcodes() reports an appropriate 
sequence of matches and edits and .ratio works as documented and expected. 
Â For 1000/2000/6000 bases, the times on a old Athlon 2800 machine are 
&lt;1/2/12 seconds. Since 6000 is longer than most genes, this is a realistic 
and practical use.<br>


<br>
With the heuristic, everything is junk and there is only one match, 
&#39;&#39;==&#39;&#39; augmented by the initial prefix of matching bases. This 
is followed by one edit: replace the rest of the first sequence with the rest 
of the second sequence. A much faster way to find the first mismatch would 
be<br>


 Â  i = 0<br>
 Â  while first[i] == second[i]:<br>
 Â  Â  Â i+=1<br>
The match ratio, based on the initial matching prefix only, is spuriously 
low.<br>
<br>
---<br>
Questions:<br>
<br>
1: what change should be make.<br>
<br>
Proposed fix: Disentangle the heuristic from the calculation of the internal 
b2j dict that maps items to indexes in the second sequence b. Only apply the 
heuristic (or not) afterward.<br>
<br>
Version A: Modify the heuristic to only eliminate common items when there are 
more than, say, 100 items (when len(b2j)&gt; 100 where b2j is first calculated 
without popularity deletions).<br>
<br>
The would leave DNA, protein, and printable ascii+[\n\r\t] sequences alone. On 
the other hand, realistic sequences of more than 200 code lines should have at 
least 100 different lines, and so the heuristic should continue to be applied 
when it (mostly?) &#39;should&#39; be. This change leaves the API unchanged and 
does not require a user decision.<br>


<br>
Version B: add a parameter to .__init__ to make the heuristic optional. If the 
default were True (&#39;use it&#39;), then the code would run the same as now 
(even when bad). With the heuristic turned off, users would be able to get the 
.ratio they may expect and need. On the other hand, users would have to 
understand the heuristic to know when and when not to use it.<br>


<br>
Version C: A more radical alternative would be to make one or more of the 
tuning parameters user settable, with one setting turning it off.<br>
<br>
2. What type of issue is this, and what version get changed.<br>
<br>
I see the proposal as partial reversion of a change that sometimes causes a 
regression, in order to fix the regression. Such would usually be called a 
bugfix. Other tracker reviewers claim this issue is a feature request, not a 
bugfix. Either way, 3.2 gets the fix. The practical issue is whether at least 
2.7(.1) should get the fix, or whether the bug should forever continue in 
2.x.<br>


<br>
3. Who will make the change.<br>
<br>
Eli will write a patch and I will check it. However, Georg Brandel assigned the 
issue to Tim Peters, with a request for comment, but Tim never responded. Is 
there an active committer who will grab the issue and do a commit review when a 
patch is ready?<br>


<div><div></div><div class="h5"><br>
----------<br>
<br>
_______________________________________<br>
Python tracker &lt;<a 
href="mailto:[email protected]";>[email protected]</a>&gt;<br>
&lt;<a href="http://bugs.python.org/issue2986"; 
target="_blank">http://bugs.python.org/issue2986</a>&gt;<br>
_______________________________________<br>
</div></div></blockquote></div><br></div>

_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2986] difflib.SequenceMatcher not matching long sequences

Reply via email to