[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Michael Fox
Michael Fox added the comment: I thought about it some more and the only bug here is mine, failing to explicitly set mode='rt'. Maybe back when someone invented text and binary modes they should have been clear which was to be the default for all things. Maybe when someone made the base class, i

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Wrapping a raw LZMAFile in a BufferedReader is a simple solution. But I think about extending BufferedReader so that LZMAFile and BufferedReader could use a common buffer. Perhaps add a new method to BufferedIOBase which will be called when a buffer is under

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Michael Fox
Michael Fox added the comment: I thought of an even more hazardous case: if compression == 'gz': import gzip open = gzip.open elif compression == 'xz': import lzma open = lzma.open else: pass On Mon, May 20, 2013 at 9:41 AM, Michael Fox wrote: > > Michael Fox added the comm

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Michael Fox
Michael Fox added the comment: You're right. In fact, what doesn't make sense is to be doing line-oriented reads on a binary file. Why was I doing that? I do have another quibble though. The open() function is like this: open(file, mode='r', buffering=-1, encoding=None, errors=None, ne

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Nadeem Vawda
Nadeem Vawda added the comment: No, that is the intended behavior for binary streams - they operate at the level of individual byes. If you want to treat your input file as Unicode-encoded text, you should open it in text mode. This will return a TextIOWrapper which handles the decoding and line

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-20 Thread Michael Fox
Michael Fox added the comment: I was thinking about this line: end = self._buffer.find(b"\n", self._buffer_offset) + 1 Might be a bug? For example, is there a unicode where one of several bytes is '\n'? In this case it splits the line in the middle of a character, right? On Sun, May 19, 2013 a

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Arfrever Frehtes Taifersar Arahesis
Changes by Arfrever Frehtes Taifersar Arahesis : -- nosy: +Arfrever ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscri

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Nadeem Vawda
Nadeem Vawda added the comment: > I agree that making lzma.open() wrap its return value in a BufferedReader > (or BufferedWriter, as appropriate) is the way to go. On second thoughts, there's no need to change the behavior for mode='wb'. We can just return a BufferedReader for mode='rb', and lea

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Nadeem Vawda
Nadeem Vawda added the comment: I agree that making lzma.open() wrap its return value in a BufferedReader (or BufferedWriter, as appropriate) is the way to go. I'm currently travelling and don't have my SSH key with me - Serhiy, can you make the change? I'll put together a documentation patch th

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Michael Fox
Michael Fox added the comment: io.BufferedReader works well for me. Thanks for the good suggestion. Now python 3.3 and 3.4 have similar performance to each other and they are only 2x slower than pyliblzma. >From my perspective default wrapping with io.BufferedReader is a great idea. I can't thin

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Antoine Pitrou
Antoine Pitrou added the comment: I second Serhiy here. Wrapping the LZMAFile in a BufferedReader is the simple solution to the performance problem: ./python -m timeit -s "import lzma, io" "f=lzma.LZMAFile('words.xz', 'r')" "for line in f: pass" 10 loops, best of 3: 148 msec per loop $ ./pyt

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-19 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: I'm against implementing LZMAFile in a pure C. It was a great win that LZMAFile had implemented in a pure Python. However may be we could reuse the existing accelerated implementation of io.BufferedReader. -- ___

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Raymond Hettinger
Raymond Hettinger added the comment: Serhiy, would you like to take this one? -- assignee: -> serhiy.storchaka stage: -> needs patch versions: +Python 3.4 -Python 3.3 ___ Python tracker __

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Raymond Hettinger
Raymond Hettinger added the comment: > So, unless someone thinks that a pure C extension is the > right technical direction, lzma in 3.4 is probably as fast > as it's ever going to be. I would support the inclusion of a C extension. Reasonable performance is a prerequisite for broader adoption

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: Try `f = io.BufferedReader(f)`. -- nosy: +serhiy.storchaka ___ Python tracker ___ ___ Python-bugs-

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Michael Fox
Michael Fox added the comment: I looked into it a little and it looks like pyliblzma is a pure C extension whereas new lzma library wraps liblzma but the rest is python. In particular this happens for every line: if size < 0: end = self._buffer.find(b"\n", self._buffer_offset

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Michael Fox
Michael Fox added the comment: 3.4 is much better but still 4x slower than 2.7 m@air:~/q/topaz/parse_datalog$ time python2.7 lzmaperf.py 102368 real0m0.053s user0m0.052s sys 0m0.000s m@air:~/q/topaz/parse_datalog$ time ~/tmp/cpython-23836f17e4a2/bin/python3.4 lzmaperf.py 102368 rea

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-18 Thread Nadeem Vawda
Nadeem Vawda added the comment: Have you tried running the benchmark against the default (3.4) branch? There was some significant optimization work done in issue 16034, but the changes were not backported to 3.3. -- ___ Python tracker

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-17 Thread STINNER Victor
Changes by STINNER Victor : -- nosy: +haypo ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.

[issue18003] New lzma crazy slow with line-oriented reading.

2013-05-17 Thread Michael Fox
New submission from Michael Fox: import lzma count = 0 f = lzma.LZMAFile('bigfile.xz' ,'r') for line in f: count += 1 print(count) Comparing python2 with pyliblzma to python3.3.1 with the new lzma: m@air:~/q/topaz/parse_datalog$ time python lzmaperf.py 102368 real0m0.062s user0m0.