New submission from hajoscher <hajosc...@gmail.com>:
Buffer read of large files in a compressed tarfile stream performs poorly. The buffered read in tarfile _Stream is extending a bytes object. It is much more efficient to use a list followed by a join. Using a list can mean seconds instead of minutes. This performance regression was introduced in b506dc32c1a. How to test: # create random tarfile 50Mb dd if=/dev/urandom of=test.bin count=50 bs=1M tar czvf test.tgz test.bin # read with tarfile as stream (note pipe symbol in 'r|gz') import tarfile tfile = tarfile.open("test.tgz", 'r|gz') for t in tfile: file = tfile.extractfile(t) if file: print(len(file.read())) ---------- components: Library (Lib) messages: 320763 nosy: hajoscher priority: normal severity: normal status: open title: tarfile stream read performance type: performance versions: Python 3.4, Python 3.5, Python 3.6, Python 3.7, Python 3.8 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue34010> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com