Bugs item #1527974, was opened at 2006-07-24 23:00 Message generated for change (Comment added) made by arve_knudsen You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1527974&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.4 Status: Open Resolution: None Priority: 5 Submitted By: Arve Knudsen (arve_knudsen) Assigned to: Nobody/Anonymous (nobody) Summary: tarfile chokes on ipython archive on Windows Initial Comment: I'm trying to extract files from the latest ipython tar archive, available from http://ipython.scipy.org/dist/ipython-0.7.2.tar.gz, using tarfile. This is on Windows XP, using Python 2.4.3. There is only a problem if I open the archive in stream mode (the "mode" argument to tarfile.open is "r|gz"), in which case tarfile raises StreamError. I'd be happy if this error could be sorted out. The following script should trigger the error: import tarfile f = file(r"ipython-0.7.2.tar.gz", "rb") tar = tarfile.open(fileobj=f, mode="r|gz") try: for m in tar: tar.extract(m) finally: tar.close() f.close( The resulting exception: Traceback (most recent call last): File "tst.py", line 7, in ? tar.extract(m) File "C:\Program Files\Python24\lib\tarfile.py", line 1335, in extract self._extract_member(tarinfo, os.path.join(path, tarinfo.name)) File "C:\Program Files\Python24\lib\tarfile.py", line 1431, in _extract_member self.makelink(tarinfo, targetpath) File "C:\Program Files\Python24\lib\tarfile.py", line 1515, in makelink self._extract_member(self.getmember(linkpath), targetpath) File "C:\Program Files\Python24\lib\tarfile.py", line 1423, in _extract_member self.makefile(tarinfo, targetpath) File "C:\Program Files\Python24\lib\tarfile.py", line 1461, in makefile copyfileobj(source, target) File "C:\Program Files\Python24\lib\tarfile.py", line 158, in copyfileobj shutil.copyfileobj(src, dst) File "C:\Program Files\Python24\lib\shutil.py", line 22, in copyfileobj buf = fsrc.read(length) File "C:\Program Files\Python24\lib\tarfile.py", line 551, in _readnormal self.fileobj.seek(self.offset + self.pos) File "C:\Program Files\Python24\lib\tarfile.py", line 420, in seek raise StreamError, "seeking backwards is not allowed" tarfile.StreamError: seeking backwards is not allowed ---------------------------------------------------------------------- >Comment By: Arve Knudsen (arve_knudsen) Date: 2006-07-27 00:20 Message: Logged In: YES user_id=1522083 Regarding my last comment, sorry about the noise. After giving it some more thought I realized it was not very realistic implementation wise, seeing as you can't know whether a file is being linked to when you encounter it in the stream (right?). So I followed your suggestion instead and handled the links on the client level. What I think I'd like to see in TarFile though is an 'extractall' method with the ability to report progress to an optional callback, since I'm only opening in stream mode as a hack to implement this myself (by monitoring file position). From browsing tarfile's source it seems it might require some effort though (with e.g. BZ2File you can't know the amount of data without decompressing everything?). ---------------------------------------------------------------------- Comment By: Arve Knudsen (arve_knudsen) Date: 2006-07-25 11:58 Message: Logged In: YES user_id=1522083 Yes I admit that is a weakness to my proposed approach. Perhaps it would be a better idea to extract hardlinked files to a temporary location and copy those files when needed, as a cache? The only problem that I can think of with this approach is the overhead, but perhaps this could be configurable through a keyword if you think it would pose a significant problem (i.e. keeping extra copies of potentially huge files)? The temporary cache would be private to tarfile, so there should be no need to worry about modifications to the contained files. ---------------------------------------------------------------------- Comment By: Lars Gustäbel (gustaebel) Date: 2006-07-25 11:31 Message: Logged In: YES user_id=642936 Copying the previously extracted file is no option. When the archive is extracted inside a loop, you never know what happens between two extract() calls. The original file could have been renamed, changed or removed. Suppose you want to extract just those members which are hard links: for tarinfo in tar: if tarinfo.islnk(): tar.extract(tarinfo) I agree with you that the error message is bad because it does not give the slightest idea of what's going wrong. I'll see what I can do about that. To work around your particular problem, my idea is to subclass the TarFile class and replace the makelink() method with one that simply copies the file as you proposed. ---------------------------------------------------------------------- Comment By: Arve Knudsen (arve_knudsen) Date: 2006-07-25 10:59 Message: Logged In: YES user_id=1522083 Thanks for the clarification, Lars. I'd prefer to continue with my current approach however, since it allows me to report progress as the tarfile is unpacked/decompressed. Also, I don't think it would be satisfactory at all if tarfile would just die with a mysterious error in such cases. In order to resolve this, why must tarfile extract the file again, can't it copy the already extracted file? ---------------------------------------------------------------------- Comment By: Lars Gustäbel (gustaebel) Date: 2006-07-25 10:42 Message: Logged In: YES user_id=642936 The traceback tells me that there is a hard link inside the archive which means that a file in the archive is referenced to twice. This hard link can be extracted only on platforms that have an os.link() function. On Win32 they're not supported by the file system, but tarfile works around this by extracting the referenced file twice. In order to extract the file the second time it is necessary that tarfile seeks back in the input file to access the file's data again. But "seeking backwards is not allowed" when a file is opened in streaming mode ;-) If you do not necessarily need streaming mode for your application, better use "r:gz" or "r" and the problem will be gone. ---------------------------------------------------------------------- Comment By: Arve Knudsen (arve_knudsen) Date: 2006-07-25 10:04 Message: Logged In: YES user_id=1522083 Ok, I've verified now that the problem persists with Python 2.4.4 (from the 2.4 branch in svn). The exact same thing happens. ---------------------------------------------------------------------- Comment By: Arve Knudsen (arve_knudsen) Date: 2006-07-25 09:29 Message: Logged In: YES user_id=1522083 Well yeah, it appears to be Windows specific. I just tested on Linux (Ubuntu), also with Python 2.4.3. I'll try 2.4.3+ on Windows to see if it makes any difference. Come to think of it I think I experienced this problem in that past on Linux, but then I solved it by repacking ipython. Also, if I pack it myself on Windows using bsdtar it works fine. ---------------------------------------------------------------------- Comment By: Neal Norwitz (nnorwitz) Date: 2006-07-25 05:35 Message: Logged In: YES user_id=33168 I tested this on Linux with both 2.5 and 2.4.3+ without problems. I believe there were some fixes in this area. Could you try testing with the 2.4.3+ current which will become 2.4.4 (or 2.5b2)? If this is still a problem, it looks like it may be Windows specific. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1527974&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com