[issue18744] pathological performance using tarfile

K Richard Pixley Wed, 14 Aug 2013 22:21:36 -0700

New submission from K Richard Pixley:

There's a problem with tarfile.  Write a program to traverse the contents of a 
modest sized tar archive.  Make sure your tar archive is compressed.  Then read 
the tar archive with your program.


I'm finding that allowing tarfile to read a compressed archive costs me 
somewhere on the order of a 60x performance penalty by comparison to opening 
the file with gzip, then passing the gzip contents to tarfile.  Programs that 
could take a few minutes are literally taking a few hours when using tarfile.

This seems stupid.  The tarfile library could do the same thing I'm doing 
manually, in fact, I had assumed that it would and was surprised by the 
performance I was seeing, so I ran with the profiler and saw millions of 
decompression calls.  It's almost as though the tarfile library is 
decompressing the entire archive for every member extraction.

Note, you can get even worse performance if you sort the member names and then 
extract in that order.  I'm not sure whether this "should" matter since the tar 
file order is sequential.

----------
components: Library (Lib)
messages: 195232
nosy: teamnoir
priority: normal
severity: normal
status: open
title: pathological performance using tarfile
type: performance
versions: Python 2.7

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue18744>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue18744] pathological performance using tarfile

Reply via email to