[issue10436] tarfile.extractfile in "r|" stream mode fails with filenames or members from getmembers()

2010-11-16 Thread David Nesting

New submission from David Nesting :

When opening a tarfile with mode "r|" (streaming mode), extractfile("filename") 
and extractfile(mytarfile.getmembers()[0]) raise "tarfile.StreamError: seeking 
backwards is not allowed".  extractfile(mytarfile.next()) succeeds.  A more 
complete test case:

"""
import tarfile
import StringIO

# Create a simple tar file in memory.  This could easily be a real tar file
# though.
data = StringIO.StringIO()
tf = tarfile.open(fileobj=data, mode="w")
tarinfo = tarfile.TarInfo(name="testfile")
filedata = StringIO.StringIO("test data")
tarinfo.size = len(filedata.getvalue())
tf.addfile(tarinfo, fileobj=filedata)
tf.close()
data.seek(0)

# Open as an uncompressed stream
tf = tarfile.open(fileobj=data, mode="r|")

#f = tf.extractfile("testfile")
#print "%s: %s" % (f.name, f.read())
#
#Traceback (most recent call last):
#  File "./bug.py", line 19, in 
#print "%s: %s" % (f.name, f.read())
#  File "/usr/lib/python2.7/tarfile.py", line 815, in read
#buf += self.fileobj.read()
#  File "/usr/lib/python2.7/tarfile.py", line 735, in read
#return self.readnormal(size)
#  File "/usr/lib/python2.7/tarfile.py", line 742, in readnormal
#self.fileobj.seek(self.offset + self.position)
#  File "/usr/lib/python2.7/tarfile.py", line 554, in seek
#raise StreamError("seeking backwards is not allowed")
#tarfile.StreamError: seeking backwards is not allowed

#for member in tf.getmembers():
#  f = tf.extractfile(member)
#  print "%s: %s" % (f.name, f.read())
#
# Same traceback

while True:
  member = tf.next()
  if member is None:
break
  f = tf.extractfile(member)
  print "%s: %s" % (f.name, f.read())

# This works.
"""

It appears that extractfile("filename") invokes getmember("filename"), which 
invokes getmembers().  getmembers() scans the entire file before returning 
results, and by doing so, it's read past and discarded the actual file data, 
which makes it impossible for us to actually extract it.

If this is accurate, this seems tricky to completely fix.  You could make 
getmembers() a generator that doesn't read too far ahead so that the file's 
contents are still available if someone wants to retrieve them for each file 
yielded.  getmember("filename") could just scan forward through the file until 
it hits a match, but you'd still lose the ability to do a getmember("filename") 
on a file that we skipped over.

If nothing else, document that extractfile("filename"), getmember() and 
getmembers() won't work reliably in streaming mode, and possibly raise an 
exception whenever someone tries just to make behavior consistent.

--
components: Library (Lib)
messages: 121308
nosy: David.Nesting
priority: normal
severity: normal
status: open
title: tarfile.extractfile in "r|" stream mode fails with filenames or members 
from getmembers()
type: behavior
versions: Python 2.6, Python 2.7

___
Python tracker 
<http://bugs.python.org/issue10436>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue10436] tarfile.extractfile in "r|" stream mode fails with filenames or members from getmembers()

2010-11-17 Thread David Nesting

David Nesting  added the comment:

Thanks, Lars.  And this does make complete sense to me in retrospect.

Better documentation here would help a lot.  I'm happy to take a stab at this.  
Short of labeling methods as "safe for streaming" versus "unsafe for 
streaming", it occurs to me that it would be a lot cleaner if TarFile were 
actually broken up into two classes: one streaming-safe, and the other layering 
random access convenience methods on top of that.  For compatibility's sake the 
open method should probably still return an instance of the composite class, 
but at least it keeps these logically separate internally and makes it easier 
to document.

--

___
Python tracker 
<http://bugs.python.org/issue10436>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com