[issue7090] encoding uncode objects greater than FFFF

2009-10-09 Thread Mahmoud

New submission from Mahmoud :

Odd behaviour with str.encode or codecs.Codec.encode or simailar
functions, when dealing with uncode objects above 

with 2.6
>>> u'\u10380'.encode('utf')
'\xe1\x80\xb80'

with 3.x
'\u10380'.encode('utf')
'\xe1\x80\xb80'

correct output must be:
\xf0\x90\x8e\x80

--
components: Unicode
messages: 93780
nosy: msaghaei
severity: normal
status: open
title: encoding uncode objects greater than 
type: behavior
versions: Python 2.6, Python 2.7, Python 3.0, Python 3.1

___
Python tracker 
<http://bugs.python.org/issue7090>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue6396] No conversion specifier in the string, no __getitem__ method in the right hand value

2009-07-01 Thread Mahmoud

New submission from Mahmoud :

When using a class instance as a mapping for the right hand value in a
sting format expression without conversion specifier, it seems logical
that the class has a __getitem__ method. Therefore following format
expression should raise an exception.

>>> class AClass(object):
...   pass
... 
>>> c = AClass()
>>> "a string with no conversion specifier" % c
'a string with no conversion specifier'

--
messages: 89987
nosy: msaghaei
severity: normal
status: open
title: No conversion specifier in the string, no __getitem__ method in the 
right hand value
versions: Python 2.6, Python 2.7, Python 3.0, Python 3.1

___
Python tracker 
<http://bugs.python.org/issue6396>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10970] "string".encode('base64') is not the same as base64.b64encode("string")

2011-01-20 Thread Mahmoud Abdelkader

New submission from Mahmoud Abdelkader :

Given a string, encoding it with .encode('base64') is not the same as using 
base64's b64encode function. I think this is very unclear and unintuitive. 

Here's some example code to demonstrate the problem. Before I attempt to submit 
a patch, is this done for legacy reasons? Are there any reasons to use one over 
the other?

import hmac
import hashlib
import base64


signature = hmac.new('secret', 'url', hashlib.sha512).digest()
assert signature.encode('base64') == base64.b64encode(signature)

--
components: Library (Lib)
messages: 126696
nosy: mahmoudimus
priority: normal
severity: normal
status: open
title: "string".encode('base64') is not the same as base64.b64encode("string")
versions: Python 2.5, Python 2.6, Python 2.7

___
Python tracker 
<http://bugs.python.org/issue10970>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10970] "string".encode('base64') is not the same as base64.b64encode("string")

2011-01-21 Thread Mahmoud Abdelkader

Mahmoud Abdelkader  added the comment:

Thanks for the clarification Terry. This is indeed not a bug. For reference, 
the pieces of code I pasted line-wrapped after the 76th character, which was my 
main source of confusion.

After reading RFC3548, I am now informed that the behavior of string.encode is 
the correct and expected result, as the documentation per 7.8.3 state that it's 
MIME 64.

--

___
Python tracker 
<http://bugs.python.org/issue10970>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44121] Missing implementation for formatHeader and formatFooter methods of the BufferingFormatter class in the logging module.

2021-05-13 Thread Mahmoud Harmouch


New submission from Mahmoud Harmouch :

While I was browsing in the source code of the logging package, I've 
encountered missing implementations for formatHeader and formatFooter methods 
of the BufferingFormatter class(in __init__ file). Therefore, I'm going to 
implement them and push these changes in a pull request.

--
components: Library (Lib)
messages: 393565
nosy: Harmouch101
priority: normal
severity: normal
status: open
title: Missing implementation for formatHeader and formatFooter methods of the 
BufferingFormatter class in the logging module.
type: enhancement
versions: Python 3.11

___
Python tracker 
<https://bugs.python.org/issue44121>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44121] Missing implementation for formatHeader and formatFooter methods of the BufferingFormatter class in the logging module.

2021-05-13 Thread Mahmoud Harmouch


Change by Mahmoud Harmouch :


--
keywords: +patch
pull_requests: +24735
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/26095

___
Python tracker 
<https://bugs.python.org/issue44121>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue12949] Documentation of PyCode_New() lacks kwonlyargcount argument

2012-01-14 Thread Mahmoud Hashemi

Changes by Mahmoud Hashemi :


--
nosy: +mahmoud

___
Python tracker 
<http://bugs.python.org/issue12949>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13787] PyCode_New not round-trippable (TypeError)

2012-01-14 Thread Mahmoud Hashemi

New submission from Mahmoud Hashemi :

On Python 3.1.4, attempting to create a code object will apparently result in a 
TypeError (must be str, not tuple), even when you're creating a code object 
from another, working code object:

# co_what.py
def foo():
return 'bar'

co = foo.__code__

co_copy = type(co)(co.co_argcount,
   co.co_kwonlyargcount,
   co.co_nlocals,
   co.co_stacksize,
   co.co_flags,
   co.co_code,
   co.co_consts,
   co.co_names,
   co.co_varnames,
   co.co_freevars,
   co.co_cellvars,
   co.co_filename,
   co.co_name,
   co.co_firstlineno,
   co.co_lnotab)
# EOF
$ python3 co_what.py
Traceback (most recent call last):
  File "co_what.py", line 20, in 
co.co_lnotab)
TypeError: must be str, not tuple

Looking at the PyCode_New function, all the arguments look correctly matched up 
according to the signature in my Python 3.1.4 build source (looks identical to 
the trunk source):

# Objects/codeobject.c

PyCode_New(int argcount, int kwonlyargcount,
   int nlocals, int stacksize, int flags,
   PyObject *code, PyObject *consts, PyObject *names,
   PyObject *varnames, PyObject *freevars, PyObject *cellvars,
   PyObject *filename, PyObject *name, int firstlineno,
   PyObject *lnotab)
{
PyCodeObject *co;
Py_ssize_t i;

/* Check argument types */
if (argcount < 0 || nlocals < 0 ||
code == NULL ||
consts == NULL || !PyTuple_Check(consts) ||
names == NULL || !PyTuple_Check(names) ||
varnames == NULL || !PyTuple_Check(varnames) ||
freevars == NULL || !PyTuple_Check(freevars) ||
cellvars == NULL || !PyTuple_Check(cellvars) ||
name == NULL || !PyUnicode_Check(name) ||
filename == NULL || !PyUnicode_Check(filename) ||
lnotab == NULL || !PyBytes_Check(lnotab) ||
!PyObject_CheckReadBuffer(code)) {
PyErr_BadInternalCall();
return NULL;
}

And, for the record, this same behavior works just fine in the equivalent 
Python 2.

--
components: Interpreter Core
files: co_what.py
messages: 151270
nosy: mahmoud
priority: normal
severity: normal
status: open
title: PyCode_New not round-trippable (TypeError)
type: behavior
versions: Python 3.1, Python 3.2
Added file: http://bugs.python.org/file24239/co_what.py

___
Python tracker 
<http://bugs.python.org/issue13787>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13787] PyCode_New not round-trippable (TypeError)

2012-01-14 Thread Mahmoud Hashemi

Mahmoud Hashemi  added the comment:

And here's the working Python 2 version (works fine on Python 2.7, and likely a 
few versions prior).

--
Added file: http://bugs.python.org/file24240/co_what2.py

___
Python tracker 
<http://bugs.python.org/issue13787>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13787] PyCode_New not round-trippable (TypeError)

2012-01-17 Thread Mahmoud Hashemi

Mahmoud Hashemi  added the comment:

Yes, I knew it was an issue with crossed wires somewhere. The Python 2 code 
doesn't translate well to Python 3 because the function signature changed to 
add kwargonlycount. And I guess the argument order is substantially different, 
too, as described in Objects/codeobject.c#l291.

Thanks for clearing that up, though,

Mahmoud

--

___
Python tracker 
<http://bugs.python.org/issue13787>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue17911] traceback: add a new thin class storing a traceback without storing local variables

2015-02-10 Thread Mahmoud Hashemi

Mahmoud Hashemi added the comment:

Hey all, great to see this being worked on so diligently for so long. Having 
worked in this area for a while (at home and at PayPal), we've got a few 
learnings to share:

1) linecache is textbook not-threadsafe. For example, 
https://hg.python.org/cpython/file/default/Lib/linecache.py#l38

For a lightweight traceback wrapper to be concurrency-friendly, we've had to 
catch KeyErrors, like so: 
https://github.com/mahmoud/boltons/blob/master/boltons/tbutils.py#L115

It's kind of a blanket approach, but maybe we could make a separate issue and 
help out with a linecache refresh?

2) We use something like (filename, lineno) in our DeferredLine class, but for 
very lightweight areas (e.g., greenlet creation) we just save a reference to 
the code object, as the additional attribute accesses do end up showing up in 
the profiles.

3) Generally we've found the APIs in TracebackInfo here to be pretty 
sufficient/functional: 

https://github.com/mahmoud/boltons/blob/master/boltons/tbutils.py#L134

Let me know if you've got any questions on that, and keep up the good work!

--
nosy: +mahmoud

___
Python tracker 
<http://bugs.python.org/issue17911>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23479] str.format() breaks object duck typing

2015-02-18 Thread Mahmoud Hashemi

New submission from Mahmoud Hashemi:

While porting some old code, I found some interesting misbehavior in the 
new-style string formatting. When formatting objects which support int and 
float conversion, old-style percent formatting works great, but new-style 
formatting explodes hard.

Here's a basic example:

class MyType(object):
def __init__(self, func):
self.func = func

def __float__(self):
return float(self.func())
 
 
print '%f' % MyType(lambda: 3)
 
# Output (python2 and python3): 3.00
 
 
print '{:f}'.format(MyType(lambda: 3))
 
# Output (python2):
# Traceback (most recent call last):
# File "tmp.py", line 28, in 
# print '{:f}'.format(MyType(lambda: 3))
# ValueError: Unknown format code 'f' for object of type 'str'
#
# Output (python3.4):
# Traceback (most recent call last):
# File "tmp.py", line 30, in 
# print('{:f}'.format(MyType(lambda: 3)))
# TypeError: non-empty format string passed to object.__format__ 


And the same holds true for int and so forth. I would expect these behaviors to 
be the same between the two formatting styles, and tangentially, expect a more 
python2-like error message for the python 3 case.

--
messages: 236192
nosy: mahmoud
priority: normal
severity: normal
status: open
title: str.format() breaks object duck typing
type: behavior
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6

___
Python tracker 
<http://bugs.python.org/issue23479>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23479] str.format() breaks object duck typing

2015-02-18 Thread Mahmoud Hashemi

Changes by Mahmoud Hashemi :


--
nosy: +Mark.Williams

___
Python tracker 
<http://bugs.python.org/issue23479>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23479] str.format() breaks object duck typing

2015-02-18 Thread Mahmoud Hashemi

Mahmoud Hashemi added the comment:

Well, thank you for the prompt and helpful replies everyone. Can't say I didn't 
wish the default behavior were more intuitive, but at least I think I have an 
idea how to work this. Thanks again!

--
resolution: not a bug -> 
status: closed -> open

___
Python tracker 
<http://bugs.python.org/issue23479>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25341] File mode wb+ appears as rb+

2015-10-08 Thread Mahmoud Hashemi

Changes by Mahmoud Hashemi :


--
nosy: +mahmoud

___
Python tracker 
<http://bugs.python.org/issue25341>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7359] mailbox cannot modify mailboxes in system mail spool

2016-02-07 Thread Mahmoud Hashemi

Mahmoud Hashemi added the comment:

Got bit by this, and since it's not a bug, here's "not" a fix: 
http://boltons.readthedocs.org/en/latest/mboxutils.html#boltons.mboxutils.mbox_readonlydir

Been in production for a while, working like a charm. Might there be interest 
in including this in the standard lib?

--
nosy: +mahmoud

___
Python tracker 
<http://bugs.python.org/issue7359>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26623] JSON encode: more informative error

2016-03-23 Thread Mahmoud Lababidi

New submission from Mahmoud Lababidi:

The json.dumps()/encode functionality will raise an Error when an object that 
cannot be json-encoded is encountered. The current Error message only shows the 
Object itself. I would like to enhance the error message by also providing the 
Type. This is useful when numpy.int objects are passed in, but not clear that 
they are numpy objects.

--
components: Library (Lib)
messages: 262272
nosy: Mahmoud Lababidi
priority: normal
severity: normal
status: open
title: JSON encode: more informative error
type: enhancement
versions: Python 3.6

___
Python tracker 
<http://bugs.python.org/issue26623>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26623] JSON encode: more informative error

2016-03-23 Thread Mahmoud Lababidi

Changes by Mahmoud Lababidi :


--
keywords: +patch
Added file: http://bugs.python.org/file42258/json_encode.patch

___
Python tracker 
<http://bugs.python.org/issue26623>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26623] JSON encode: more informative error

2016-03-30 Thread Mahmoud Lababidi

Mahmoud Lababidi added the comment:

Is there a use case where the representation is too long? I think it may be 
useful to see the representation, but perhaps you are correct.

--

___
Python tracker 
<http://bugs.python.org/issue26623>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26623] JSON encode: more informative error

2016-04-04 Thread Mahmoud Lababidi

Mahmoud Lababidi added the comment:

Serhiy,

I've attached a patch without the Object representation. Choose whichever you 
feel is better.

--
Added file: http://bugs.python.org/file42366/json_encode.patch

___
Python tracker 
<http://bugs.python.org/issue26623>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24019] str/unicode encoding kwarg causes exceptions

2015-04-20 Thread Mahmoud Hashemi

New submission from Mahmoud Hashemi:

The encoding keyword argument to the Python 3 str() and Python 2 unicode() 
constructors is excessively constraining to the practical use of these core 
types.

Looking at common usage, both these constructors' primary mode is to convert 
various objects into text:

>>> str(2)
'2'

But adding an encoding yields:

>>> str(2, encoding='utf8')
Traceback (most recent call last):
  File "", line 1, in 
TypeError: coercing to str: need bytes, bytearray or buffer-like object, int 
found

While the error message is fine for an experienced developer, I would like to 
raise the question: is it necessary at all? Even harmlessly getting a str from 
a str is punished, but leaving off encoding is fine again:

>>> str('hi', encoding='utf8')
Traceback (most recent call last):
  File "", line 1, in 
TypeError: decoding str is not supported
>>> str('hi')
'hi'

Merging and simplifying the two modes of these constructors would yield much 
more predictable results for experienced and beginning Pythonists alike. 
Basically, the encoding argument should be ignored if the argument is already a 
unicode/str instance, or if it is a non-string object. It should only be 
consulted if the primary argument is a bytestring. Bytestrings already have a 
.decode() method on them, another, obscurer version of it isn't necessary.

Furthermore, despite the core nature and widespread usage of these types, 
changing this behavior should break very little existing code and 
understanding. unicode() and str() will simply behave as expected more often, 
returning text versions of the arguments passed to them. 

Appendix: To demonstrate the expected behavior of the proposed unicode/str, 
here is a code snippet we've employed to sanely and safely get a text version 
of an arbitrary object:

def to_unicode(obj, encoding='utf8', errors='strict'):
# the encoding default should look at sys's value
try:
return unicode(obj)
except UnicodeDecodeError:
return unicode(obj, encoding=encoding, errors=errors)

After many years of writing Python and teaching it to developers of all 
experience levels, I firmly believe that this is the right interaction pattern 
for Python's core text type. I'm also happy to expand on this issue, turn it 
into a PEP, or submit a patch if there is interest.

--
components: Unicode
messages: 241699
nosy: ezio.melotti, haypo, mahmoud
priority: normal
severity: normal
status: open
title: str/unicode encoding kwarg causes exceptions
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6

___
Python tracker 
<http://bugs.python.org/issue24019>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24019] str/unicode encoding kwarg causes exceptions

2015-04-21 Thread Mahmoud Hashemi

Mahmoud Hashemi added the comment:

Python already has one approach that fails to decode non-bytestrings: the 
.decode() method. 

This is about removing unicode barriers to entry and making the str constructor 
in Python 3 as succinctly useful as possible. There are several problems the 
helper does not solve:

1) Usage-wise, str/unicode is used to turn values into text. From a high-level 
perspective, the content does not change, only the representation format. 
Should this fundamental operation really require type inspection and explicit 
try/except blocks every single time? Or should it just work? sorted() does not 
raise an exception if the values are already sorted, why does str() raise an 
exception when the value is already a str?*

2) By and large, among developers, keyword arguments are viewed as "optional" 
arguments that have defaults which can be overridden. However, that is not the 
case here; str is not simply str(obj, encoding=sys.getdefaultencoding()). 
Explicitly passing the keyword argument breaks the call.

3) The helper does not help promote Python adoption when it must be copied and 
pasted it into new developer's projects. It does not help break down the 
misconception that unicode is a punishing concept to be around in Python.

* This question is posed here rhetorically, but I have gotten variations on it 
from multiple Python developers in training.

--
versions: +Python 2.7

___
Python tracker 
<http://bugs.python.org/issue24019>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24019] str/unicode encoding kwarg causes exceptions

2015-04-21 Thread Mahmoud Hashemi

Mahmoud Hashemi added the comment:

Martin, it sounds that way because that is what is being proposed: "Merging and 
simplifying the two modes". Given the existence of .decode() on bytestrings, 
the only objects that generally need decoding in Python 2 and 3, the existence 
of str/unicode's second mode constitutes a design bug.

Without a doubt, Python has frequently preferred convenient idioms over EAFP. 
Look at dict.get for an excellent example of defaults being used instead of 
forcing users to catch KeyErrors. That conversation could have gone a different 
way, but Python is better off having stuck to its pragmatic roots.

In answer to your questions, Martin, 1) I'd expect str(b"123", encoding=None) 
to do the same thing as str(b"123")  and 2) I'd expect str(obj) behavior to 
continue to depend on whether the object passed is string-like. Python is a 
duck-typed, dynamic language, and dynamic languages are most powerful when 
their core types reflect usability. Consistency is one of the foremost factors 
of usability, and having to frequently switch between two call patterns of the 
str constructor feels inconsistent and unusable.

--

___
Python tracker 
<http://bugs.python.org/issue24019>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24019] str/unicode encoding kwarg causes exceptions

2015-04-22 Thread Mahmoud Hashemi

Mahmoud Hashemi added the comment:

I would urge you all take a stronger look at usability, rather than parroting 
the current state of the design and docs. Python gained renown over the years 
for its ability to stay flexible while maturing. Focusing on purity and 
ignoring the needs of practical programmers is exactly how PEP #461 ended up 
coming into play so late.

The inflexible arguments of str makes a common task, turning data into text, an 
order of magnitude harder than it needs to be.

--

___
Python tracker 
<http://bugs.python.org/issue24019>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24172] Errors in resource.getpagesize module documentation

2015-05-12 Thread Mahmoud Hashemi

New submission from Mahmoud Hashemi:

The resource module's description of resource.getpagesize is woefully 
misguiding. Reproduced in full for convenience:

resource.getpagesize()

Returns the number of bytes in a system page. (This need not be the same as 
the hardware page size.) This function is useful for determining the number of 
bytes of memory a process is using. The third element of the tuple returned by 
getrusage() describes memory usage in pages; multiplying by page size produces 
number of bytes.

Besides being vague by not referring to the third element as ru_maxrss, the 
peak RSS for the process (i.e., not the current memory usage), tests on Linux, 
Darwin, and FreeBSD show the following:

  * Linux: ru_maxrss is in kilobytes
  * Darwin (OS X): ru_maxrss is in bytes
  * FreeBSD: ru_maxrss is in kilobytes (same as Linux)

Knowing the page size is probably useful to someone, but the misinformation has 
definitely sent more than one person down the wrong path here. Additionally, 
the correct information should be up in the getrusage() method documentation, 
closer to relevant field descriptions.

Mahmoud

--
assignee: docs@python
components: Documentation
messages: 243043
nosy: docs@python, mahmoud
priority: normal
severity: normal
status: open
title: Errors in resource.getpagesize module documentation
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6

___
Python tracker 
<http://bugs.python.org/issue24172>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31561] difflib pathological behavior with mixed line endings

2017-09-23 Thread Mahmoud Al-Qudsi

Mahmoud Al-Qudsi added the comment:

Attaching file2

--
Added file: https://bugs.python.org/file47165/file2

___
Python tracker 
<https://bugs.python.org/issue31561>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31561] difflib pathological behavior with mixed line endings

2017-09-23 Thread Mahmoud Al-Qudsi

New submission from Mahmoud Al-Qudsi:

While using the icdiff command line interface to difflib, I ran into an 
interesting issue where difflib took 47 seconds to compare two simple text 
documents (a PHP source code file that had been refactored via phptidy).

On subsequent analysis, it turned out to be some sort of pathological behavior 
triggered by the presence of mixed line endings. Normalizing the line endings 
in both files to \r\n via unix2dos and then comparing (making no other changes) 
resulted in the diff calculation completing in under 2 seconds.

I have attached the documents in question (file1 and file2) to this bug report.

--
components: Library (Lib)
files: file1
messages: 302788
nosy: Mahmoud Al-Qudsi
priority: normal
severity: normal
status: open
title: difflib pathological behavior with mixed line endings
versions: Python 3.6
Added file: https://bugs.python.org/file47164/file1

___
Python tracker 
<https://bugs.python.org/issue31561>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue31561] difflib pathological behavior with mixed line endings

2017-09-24 Thread Mahmoud Al-Qudsi

Mahmoud Al-Qudsi added the comment:

@tim.peters

No, `icdiff` is not part of core and probably should be omitted from the 
remainder of this discussion.

I just checked and it's actually not a mix of line endings in each file, it's 
just that one file is \n and the other is \r\n

You can actually just duplicate this bug by taking _any_ file and copying it, 
then executing `unix2dos file1; dos2unix file2` - you'll have to perfectly 
"correct" files2 that difflib will struggle to handle.

(as a preface to what follows, I've written a binary diff and incremental 
backup utility, so I'm familiar with the intricacies and pitfalls when it comes 
to diffing. I have not looked at difflib's source code, however. Looking at the 
documentation for difflib, it's not clear whether or not it should be 
considered a naive binary diffing utility, since it does seem to have the 
concept of "lines".)

Given that _both_ input files are "correct" without line ending errors, I think 
the correct optimization here would be for difflib to "realize" that two chunks 
are "identical" but with different line endings (aka just plain different, not 
asking for this to be treated as a special case) but instead of going on to 
search for a match to either buffer, it should assume that no better match will 
be found later on and simply move on to the next block/chunk.

Of course, in the event where file2 has a line from file1 that is first present 
with a different line ending then repeated with the same line ending, difflib 
will not choose the correct line.. but that's probably not something worth 
fretting over (like you said, mixed line endings == recipe for disaster).

Of course I can understand if all this is out of the scope of difflib and not 
an endeavor worth taking up.

--

___
Python tracker 
<https://bugs.python.org/issue31561>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com