"Barak, Ron" <ron.ba...@lsi.com> writes: > I couldn't really go with the shell utilities approach, as I have no > say in my user environment, and thus cannot assume which binaries > are install on the user's machine.
I suppose if you knew your target you could just supply the external binaries to go with your application, but I agree that would probably be more of a pain than its worth for the performance gain in real world time. > I'll try and implement your last suggestion, and see if the > performance is acceptable to (human) users. In terms of tuning the third option a bit, I'd play with the tracking of the final two chunk (as mentioned in my first response), perhaps shrinking the chunk size or only processing a smaller chunk of it for lines (assuming a reasonable line size) to minimize the final loop. You could also try using splitlines() on the final buffer rather than a StringIO wrapper, although that'll have a memory hit for the constructed list but doing a small portion of the buffer would minimize that. I was curious what I could actually achieve, so here are three variants that I came up with. First, this just fine tunes slightly tracking the chunks and then only processes enough final data based on anticipated maximum length length (so if the final line is longer than that you'll only get the final MAX_LINE bytes of that line). I also found I got better performance using a smaller 1024 chunk size with GZipFile.read() than a MB - not entirely sure why although it perhaps matches the internal buffer size better: # last-chunk-2.py import gzip import sys CHUNK_SIZE = 1024 MAX_LINE = 255 in_file = gzip.open(sys.argv[1],'r') chunk = prior_chunk = '' while 1: prior_chunk = chunk # Note that CHUNK_SIZE here is in terms of decompressed data chunk = in_file.read(CHUNK_SIZE) if len(chunk) < CHUNK_SIZE: break if len(chunk) < MAX_LINE: chunk = prior_chunk + chunk line = chunk.splitlines(True)[-1] print 'Last:', line On the same test set as my last post, this reduced the last-chunk timing from about 2.7s to about 2.3s. Now, if you're willing to play a little looser with the gzip module, you can gain quite a bit more. If you directly call the internal _read() method you can bypass some of the unnecessary processing read() does, and go back to larger I/O chunks: # last-gzip.py import gzip import sys CHUNK_SIZE = 1024*1024 MAX_LINE = 255 in_file = gzip.open(sys.argv[1],'r') chunk = prior_chunk = '' while 1: try: # Note that CHUNK_SIZE here is raw data size, not decompressed in_file._read(CHUNK_SIZE) except EOFError: if in_file.extrasize < MAX_LINE: chunk = chunk + in_file.extrabuf else: chunk = in_file.extrabuf break chunk = in_file.extrabuf in_file.extrabuf = '' in_file.extrasize = 0 line = chunk[-MAX_LINE:].splitlines(True)[-1] print 'Last:', line Note that in this case since I was able to bump up CHUNK_SIZE, I take a slice to limit the work splitlines() has to do and the size of the resulting list. Using the larger CHUNK_SIZE (and it being raw size) will use more memory, so could be tuned down if necessary. Of course, the risk here is that you are dependent on the _read() method, and the internal use of the extrabuf/extrasize attributes, which is where _read() places the decompressed data. In looking back I'm pretty sure this code is safe at least for Python 2.4 through 3.0, but you'd have to accept some risk in the future. This approach got me down to 1.48s. Then, just for the fun of it, once you're playing a little looser with the gzip module, it's also doing work to compute the crc of the original data for comparison with the decompressed data. If you don't mind so much about that (depends on what you're using the line for) you can just do your own raw decompression with the zlib module, as in the following code, although I still start with a GzipFile() object to avoid having to rewrite the header processing: # last-decompress.py import gzip import sys import zlib CHUNK_SIZE = 1024*1024 MAX_LINE = 255 decompress = zlib.decompressobj(-zlib.MAX_WBITS) in_file = gzip.open(sys.argv[1],'r') in_file._read_gzip_header() chunk = prior_chunk = '' while 1: buf = in_file.fileobj.read(CHUNK_SIZE) if not buf: break d_buf = decompress.decompress(buf) # We might not have been at EOF in the read() but still have no # decompressed data if the only remaining data was not original data if d_buf: prior_chunk = chunk chunk = d_buf if len(chunk) < MAX_LINE: chunk = prior_chunk + chunk line = chunk[-MAX_LINE:].splitlines(True)[-1] print 'Last:', line This version got me down to 1.15s. So in summary, the choices when tested on my system ended up at: last 26 last-chunk 2.7 last-chunk-2 2.3 last-popen 1.7 last-gzip 1.48 last-decompress 1.12 So by being willing to mix in some more direct code with the GzipFile object, I was able to beat the overhead of shelling out to the faster utilities, while remaining in pure Python. -- David -- http://mail.python.org/mailman/listinfo/python-list