from:"Mahmoud Al\-Qudsi"

[issue31561] difflib pathological behavior with mixed line endings

2017-09-23 Thread Mahmoud Al-Qudsi


Mahmoud Al-Qudsi added the comment:

Attaching file2

--
Added file: https://bugs.python.org/file47165/file2

___
Python tracker 
<https://bugs.python.org/issue31561>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue31561] difflib pathological behavior with mixed line endings

2017-09-23 Thread Mahmoud Al-Qudsi


New submission from Mahmoud Al-Qudsi:

While using the icdiff command line interface to difflib, I ran into an 
interesting issue where difflib took 47 seconds to compare two simple text 
documents (a PHP source code file that had been refactored via phptidy).

On subsequent analysis, it turned out to be some sort of pathological behavior 
triggered by the presence of mixed line endings. Normalizing the line endings 
in both files to \r\n via unix2dos and then comparing (making no other changes) 
resulted in the diff calculation completing in under 2 seconds.

I have attached the documents in question (file1 and file2) to this bug report.

--
components: Library (Lib)
files: file1
messages: 302788
nosy: Mahmoud Al-Qudsi
priority: normal
severity: normal
status: open
title: difflib pathological behavior with mixed line endings
versions: Python 3.6
Added file: https://bugs.python.org/file47164/file1

___
Python tracker 
<https://bugs.python.org/issue31561>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue31561] difflib pathological behavior with mixed line endings

2017-09-24 Thread Mahmoud Al-Qudsi


Mahmoud Al-Qudsi added the comment:

@tim.peters

No, `icdiff` is not part of core and probably should be omitted from the 
remainder of this discussion.

I just checked and it's actually not a mix of line endings in each file, it's 
just that one file is \n and the other is \r\n

You can actually just duplicate this bug by taking _any_ file and copying it, 
then executing `unix2dos file1; dos2unix file2` - you'll have to perfectly 
"correct" files2 that difflib will struggle to handle.

(as a preface to what follows, I've written a binary diff and incremental 
backup utility, so I'm familiar with the intricacies and pitfalls when it comes 
to diffing. I have not looked at difflib's source code, however. Looking at the 
documentation for difflib, it's not clear whether or not it should be 
considered a naive binary diffing utility, since it does seem to have the 
concept of "lines".)

Given that _both_ input files are "correct" without line ending errors, I think 
the correct optimization here would be for difflib to "realize" that two chunks 
are "identical" but with different line endings (aka just plain different, not 
asking for this to be treated as a special case) but instead of going on to 
search for a match to either buffer, it should assume that no better match will 
be found later on and simply move on to the next block/chunk.

Of course, in the event where file2 has a line from file1 that is first present 
with a different line ending then repeated with the same line ending, difflib 
will not choose the correct line.. but that's probably not something worth 
fretting over (like you said, mixed line endings == recipe for disaster).

Of course I can understand if all this is out of the scope of difflib and not 
an endeavor worth taking up.

--

___
Python tracker 
<https://bugs.python.org/issue31561>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue31561] difflib pathological behavior with mixed line endings

[issue31561] difflib pathological behavior with mixed line endings

[issue31561] difflib pathological behavior with mixed line endings

3 matches

Site Navigation

Mail list logo

Footer information