David Bolen <[EMAIL PROTECTED]> writes: > If you are going to read the file data incrementally from the zip file > (which is what my other post provided) you'll prevent the huge memory > allocations and risk of running out of resource, but would have to > implement your own line ending support if you then needed to process > that data in a line-by-line mode. Not terribly hard, but more > complicated than my prior sample which just returned raw data chunks.
Here's a small example of a ZipFile subclass (tested a bit this time) that implements two generator methods: read_generator Yields raw data from the file readline_generator Yields "lines" from the file (per splitlines) It also corrects my prior code posting which didn't really skip over the file header properly (due to the variable sized name/extra fields). Needs Python 2.3+ for generator support (or 2.2 with __future__ import) Peak memory use is set "roughly" by the optional chunk parameter. It's roughly since that's an uncompressed chunk so will grow in memory during decompression. And the readline generator adds further copies for the data split into lines. For your file processing by line, it could be used as in: zipf = ZipFileGen('somefile.zip') g = zipf.readline_generator('somefilename.txt') for line in g: dealwithline(line) zipf.close() Even if not a perfect match, it should point you further in the right direction. -- David - - - - - - - - - - - - - - - - - - - - - - - - - import zipfile import zlib import struct class ZipFileGen(zipfile.ZipFile): def read_generator(self, name, chunk=65536): """Return a generator that yields file bytes for name incrementally. The optional chunk parameter controls the chunk size read from the underlying zip file. For compressed files, the data length returned by the generator will be larger as the decompressed version of a chunk. Note that unlike read(), this method does not preserve the internal file pointer and should not be mixed with write operations. Nor does it verify that the ZipFile is still opened and for reading. Multiple generators returned by this function are not designed to be used simultaneously (they do not re-seek the underlying file for each request.""" zinfo = self.getinfo(name) compressed = (zinfo.compress_type == zipfile.ZIP_DEFLATED) if compressed: dc = zlib.decompressobj(-15) self.fp.seek(zinfo.header_offset) # Skip the file header (from zipfile.ZipFile.read()) fheader = self.fp.read(30) if fheader[0:4] != zipfile.stringFileHeader: raise zipfile.BadZipfile, "Bad magic number for file header" fheader = struct.unpack(zipfile.structFileHeader, fheader) fname = self.fp.read(fheader[zipfile._FH_FILENAME_LENGTH]) if fheader[zipfile._FH_EXTRA_FIELD_LENGTH]: self.fp.read(fheader[zipfile._FH_EXTRA_FIELD_LENGTH]) # Process the file incrementally remain = zinfo.compress_size while remain: bytes = self.fp.read(min(remain, chunk)) remain -= len(bytes) if compressed: bytes = dc.decompress(bytes) yield bytes if compressed: bytes = dc.decompress('Z') + dc.flush() if bytes: yield bytes def readline_generator(self, name, chunk=65536): """Return a generator that yields lines from a file within the zip incrementally. Line ending detection based on splitlines(), and like file.readline(), the returned line does not include the line ending. Efficiency not guaranteed if used with non-textual files. Uses a read_generator() generator to retrieve file data incrementally, so it inherits the limitations of that method as well, and the optional chunk parameter is passed to read_generator unchanged.""" partial = '' g = self.read_generator(name, chunk=chunk) for bytes in g: # Break current chunk into lines lines = bytes.splitlines() # Add any prior partial line to first line if partial: lines[0] = partial + lines[0] # If the current chunk didn't happen to break on a line ending, # save the partial line for next time if bytes[-1] not in ('\n', '\r'): partial = lines.pop() # Then yield the lines we've identified so far for curline in lines: yield curline # Return any trailing data (if file didn't end in a line ending) if partial: yield partial -- http://mail.python.org/mailman/listinfo/python-list