Nir <n...@winpdb.org> added the comment: First patch, please forgive long comment :)
I submit a small patch which speeds up readline() on my data set - a 74MB (5MB .gz) log file with 600K lines. The speedup is 350%. Source of slowness is that (~20KB) extrabuf is allocated/deallocated in read() and _unread() with each call to readline(). In the patch read() returns a slice from extrabuf and defers manipulation of extrabuf to _read(). In the following, the first timeit() corresponds to reading extrabuf slices while the second timeit() corresponds to read() and _unread() as they are done today: >>> timeit.Timer("x[10000: 10100]", "x = 'x' * 20000").timeit() 0.25299811363220215 >>> timeit.Timer("x[: 100]; x[100:]; x[100:] + x[: 100]", "x = 'x' * 10000").timeit() 5.843876838684082 Another speedup is achieved by doing a small shortcut in readline() for the typical case in which the entire line is already in extrabuf. The patch only addresses the typical case of calling readline() with no arguments. It does not address other problems in readline() logic. In particular the current 512 chunk size is not a sweet spot. Regardless of the size argument passed to readline(), read() will continue to decompress just 1024 bytes with each call as the size of extrabuf swings around the target size argument as result of the interaction between _unread() and read(). ---------- keywords: +patch nosy: +nirai Added file: http://bugs.python.org/file15536/gzip_7471_patch.diff _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue7471> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com