[issue34010] tarfile stream read performance

hajoscher Sat, 30 Jun 2018 02:27:37 -0700


New submission from hajoscher <hajosc...@gmail.com>:


Buffer read of large files in a compressed tarfile stream performs poorly.

The buffered read in tarfile _Stream is extending a bytes object. 
It is much more efficient to use a list followed by a join. 
Using a list can mean seconds instead of minutes. 

This performance regression was introduced in b506dc32c1a. 

How to test:
# create random tarfile 50Mb
dd if=/dev/urandom of=test.bin count=50 bs=1M
tar czvf test.tgz test.bin

# read with tarfile as stream (note pipe symbol in 'r|gz')
import tarfile
tfile = tarfile.open("test.tgz", 'r|gz')
for t in tfile:
    file = tfile.extractfile(t)
    if file:
        print(len(file.read()))

----------
components: Library (Lib)
messages: 320763
nosy: hajoscher
priority: normal
severity: normal
status: open
title: tarfile stream read performance
type: performance
versions: Python 3.4, Python 3.5, Python 3.6, Python 3.7, Python 3.8

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue34010>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34010] tarfile stream read performance

Reply via email to