"Barak, Ron" <ron.ba...@lsi.com> writes: > I thought maybe someone has a way to unzip just the end portion of > the archive (instead of the whole archive), as only the last part is > needed for reading the last line.
The problem is that gzip compressed output has no reliable intermediate break points that you can jump to and just start decompressing without having worked through the prior data. In your specific code, using readlines() is probably not ideal as it will create the full list containing all of the decoded file contents in memory only to let you pick the last one. So a small optimization would be to just iterate through the file (directly or by calling readline()) until you reach the last line. However, since you don't care about the bulk of the file, but only need to work with the final line in Python, this is an activity that could be handled more efficiently handled with external tools, as you need not involve much intepreter time to actually decompress/discard the bulk of the file. For example, on my system, comparing these two cases: # last.py import gzip import sys in_file = gzip.open(sys.argv[1],'r') for line in in_file: pass print 'Last:', line # last-popen.py import sys from subprocess import Popen, PIPE # Implement gzip -dc <file> | tail -1 gzip = Popen(['gzip', '-dc', sys.argv[1]], stdout=PIPE) tail = Popen(['tail', '-1'], stdin=gzip.stdout, stdout=PIPE) line = tail.communicate()[0] print 'Last:', line with an ~80MB log file compressed to about 8MB resulted in last.py taking about 26 seconds, while last-popen took about 1.7s. Both resulted in the same value in "line". As long as you have local binaries for gzip/tail (such as Cygwin or MingW or equivalent) this works fine on Windows systems too. If you really want to keep everything in Python, then I'd suggest working to optimize the "skip" portion of the task, trying to decompress the bulk of the file as quickly as possible. For example, one possibility would be something like: # last-chunk.py import gzip import sys from cStringIO import StringIO in_file = gzip.open(sys.argv[1],'r') chunks = ['', ''] while 1: chunk = in_file.read(1024*1024) if not chunk: break del chunks[0] chunks.append(chunk) data = StringIO(''.join(chunks)) for line in data: pass print 'Last:', line with the idea that you decode about a MB at a time, holding onto the final two chunks (in case the actual final chunk turns out to be smaller than one of your lines), and then only process those for lines. There's probably some room for tweaking the mechanism for holding onto just the last two chunks, but I'm not sure it will make a major difference in performance. In the same environment of mine as the earlier tests, the above took about 2.7s. So still much slower than the external utilities in percentage terms, but in absolute terms, a second or so may not be critical for you compared to pure Python. -- David -- http://mail.python.org/mailman/listinfo/python-list