[issue18003] New lzma crazy slow with line-oriented reading.

Michael Fox Mon, 20 May 2013 10:34:51 -0700

Michael Fox added the comment:

I thought about it some more and the only bug here is mine, failing to
explicitly set mode='rt'.


Maybe back when someone invented text and binary modes they should
have been clear which was to be the default for all things. Maybe when
someone made the base class, io.IOBase they should have defined an
.open() with a mode that matched open(). Maybe when someone first
implemented gzip.open() they should have matched open().

But that's not what happened and there's going to be lots of code out
there relying on the default 'rt' for open() and 'rb' for
gzip/bz2/lzma.open(). There's going to be lots of bugs in the future
as people familiar with one thing assume the default is the same for
the other. But oh well. It's too late change.

On Mon, May 20, 2013 at 9:50 AM, Michael Fox <[email protected]> wrote:
>
> Michael Fox added the comment:
>
> I thought of an even more hazardous case:
>
> if compression == 'gz':
>     import gzip
>     open = gzip.open
> elif compression == 'xz':
>     import lzma
>     open = lzma.open
> else:
>     pass
>
> On Mon, May 20, 2013 at 9:41 AM, Michael Fox <[email protected]> wrote:
>>
>> Michael Fox added the comment:
>>
>> You're right. In fact, what doesn't make sense is to be doing
>> line-oriented reads on a binary file. Why was I doing that?
>>
>> I do have another quibble though. The open() function is like this:
>>
>> open(file, mode='r', buffering=-1, encoding=None,
>>          errors=None, newline=None, closefd=True, opener=None) -> file object
>>
>> The lzma.open() function is like this:
>>
>> lzma.open = open(filename, mode='rb', *, format=None, check=-1,
>> preset=None, filters=None, encoding=None, errors=None, newline=None)
>>
>> It seems to me that it would be best for them to be as congruent as
>> possible. Because people will try to do this (I did):
>>
>> if filename.endswith('.xz'):
>>     f = lzma.open(filename)
>> else:
>>     f = open(filename)
>> for line in f: ...
>>
>> And then they will be in for a surprise. Would you consider changing
>> the default mode of lzma.open() to 'rt' and implement the 'buffering'
>> parameter as it is implemented in open()? And further, can we discuss
>> whether "duck typing" is becoming generally problematic in an
>> expanding standard library and whether there should be some process,
>> language, testing or something to ensure the ducks really quack the
>> same?
>>
>> For example, there could be a standard testsuite which everything
>> purporting to implement an open() function should be subject to.
>>
>> On Mon, May 20, 2013 at 7:42 AM, Nadeem Vawda <[email protected]> wrote:
>>>
>>> Nadeem Vawda added the comment:
>>>
>>> No, that is the intended behavior for binary streams - they operate at
>>> the level of individual byes. If you want to treat your input file as
>>> Unicode-encoded text, you should open it in text mode. This will return a
>>> TextIOWrapper which handles the decoding and line splitting properly.
>>>
>>> ----------
>>>
>>> _______________________________________
>>> Python tracker <[email protected]>
>>> <http://bugs.python.org/issue18003>
>>> _______________________________________
>>
>> --
>>
>> -
>> Michael
>>
>> ----------
>>
>> _______________________________________
>> Python tracker <[email protected]>
>> <http://bugs.python.org/issue18003>
>> _______________________________________
>
> --
>
> -
> Michael
>
> ----------
>
> _______________________________________
> Python tracker <[email protected]>
> <http://bugs.python.org/issue18003>
> _______________________________________

-- 

-
Michael

----------

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue18003>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18003] New lzma crazy slow with line-oriented reading.

Reply via email to