New submission from Skip Montanaro:

I've had the opportunity to use the seek() method of the gzip.GzipFile class 
for the first time in the past few days. Wondering why it seemed my processing 
times were so slow, I took a look at the code for seek() and read(). It seems 
like the chunk size for reading (1024 bytes) is rather small. I created a 
simple subclass that overrode just seek() and read(), then defined a CHUNK_SIZE 
to be 16 * 8192 bytes (the whole idea of compressing files is that they get 
large, right? seems like most of the time we will want to seek pretty far 
through the file).

Over a small subset of my inputs, I measured about a 2x decrease in run times, 
from about 54s to 26s. I ran using both gzip.GzipFile and my subclass several 
times, measuring the last four runs (two using the stdlib implementation, two 
using my subclass). I measured both the total time of the run, the time to 
process each input records, and time to execute just the seek() call for each 
record. The bulk of the per-record time was in the call to seek(), so by 
reducing that time, I sped up my run-times significantly.

I'm still using 2.7, but other than the usual 2.x->3.x changes, the code looks 
pretty much the same between 2.7 and (at least) 3.3, and the logic involving 
the read size doesn't seem to have changed at all.

I'll try to produce a patch if I have a few minutes, but in the meantime, I've 
attached my modified GzipFile class (produced against 2.7).

----------
components: Library (Lib)
files: gzipseek.py
messages: 213883
nosy: skip.montanaro
priority: normal
severity: normal
status: open
title: Rather modest chunk size in gzip.GzipFile
type: performance
versions: Python 3.4, Python 3.5
Added file: http://bugs.python.org/file34466/gzipseek.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue20962>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to