Nick Craig-Wood <[EMAIL PROTECTED]> writes: > Kevin Ar18 <[EMAIL PROTECTED]> wrote: >> >> I posted this on the forum, but nobody seems to know the solution: >> http://python-forum.org/py/viewtopic.php?t=5230 >> >> I have a zip file that is several GB in size, and one of the files inside >> of it is several GB in size. When it comes time to read the 5+GB file from >> inside the zip file, it fails with the following error: >> File "...\zipfile.py", line 491, in read bytes = >> self.fp.read(zinfo.compress_size) >> OverflowError: long it too large to convert to int > > That will be an number which is bigger than 2**31 == 2 GB which can't > be converted to an int. > > It would be explained if zinfo.compress_size is > 2GB, eg > > >>> f=open("z") > >>> f.read(2**31) > Traceback (most recent call last): > File "<stdin>", line 1, in ? > OverflowError: long int too large to convert to int > > However it would seem nuts that zipfile is trying to read > 2GB into > memory at once!
Perhaps, but that's what the read(name) method does - returns a string containing the contents of the selected file. So I think this runs into a basic issue of the maximum length of Python strings (at least in 32bit builds, not sure about 64bit) as much as it does an issue with the zipfile module. Of course, the fact that the only "read" method zipfile has is to return the entire file as a string might be considered a design flaw. For the OP, if you know you are going to be dealing with very large files, you might want to implement your own individual file extraction, since I'm guessing you don't actually need all 5+GB of the problematic file loaded into memory in a single I/O operation, particularly if you're just going to write it out again, which is what your original forum code was doing. I'd probably suggest just using the getinfo(name) method to return the ZipInfo object for the file in question, then process the appropriate section of the zip file directly. E.g., just seek to the proper offset, then read the data incrementally up to the full size from the ZipInfo compress_size attribute. If the files are compressed, you can incrementally hand their data to the decompressor prior to other processing. E.g., instead of your original: fileData = dataObj.read(i) fileHndl = file(fileName,"wb") fileHndl.write(fileData) fileHndl.close() something like (untested): CHUNK = 65536 # I/O chunk size fileHndl = file(fileName,"wb") zinfo = dataObj.getinfo(i) compressed = (zinfo.compress_type == ZLIB_DEFLATED) if compressed: dc = zlib.decompressobj(-15) dataObj.fp.seek(zinfo.header_offset+30) remain = zinfo.compress_size while remain: bytes = dataObj.fp.read(min(remain, CHUNK)) remain -= len(bytes) if compressed: bytes = dc.decompress(bytes) fileHndl.write(bytes) if compressed: bytes = dc.decompress('Z') + dc.flush() if bytes: fileHndl.write(bytes) fileHndl.close() Note the above assumes you are only reading from the zip file as it doesn't maintain the current read() method invariant of leaving the file pointer position unchanged, but you could add that too. You could also verify the file CRC along the way if you wanted to. Might be even better if you turned the above into a generator, perhaps as a new method on a local ZipFile subclass. Use the above as a read_gen method with the write() calls replaced with "yield bytes", and your outer code could look like: fileHndl = file(fileName,"wb") for bytes in dataObj.read_gen(i): fileHndle.write(bytes) fileHndl.close() -- David -- http://mail.python.org/mailman/listinfo/python-list