On 2009-01-09, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote: > On Fri, 09 Jan 2009 15:34:17 +0000, MRAB wrote: > >> Marc 'BlackJack' Rintsch wrote: >> >>> def iter_max_values(blocks, block_count): >>> for i, block in enumerate(blocks): >>> histogram = defaultdict(int) >>> for byte in block: >>> histogram[byte] += 1 >>> >>> yield max((count, byte) >>> for value, count in histogram.iteritems())[1] >>> >> [snip] >> Would it be faster if histogram was a list initialised to [0] * 256? > > Don't know. Then for every byte in the 2??GiB we have to call `ord()`. > Maybe the speedup from the list compensates this, maybe not. > > I think that we have to to something with *every* byte of that really > large file *at Python level* is the main problem here. In C that's just > some primitive numbers. Python has all the object overhead.
Using buffers or arrays of bytes instead of strings/lists would probably reduce the overhead quite a bit. -- Grant Edwards grante Yow! I've got an IDEA!! at Why don't I STARE at you visi.com so HARD, you forget your SOCIAL SECURITY NUMBER!! -- http://mail.python.org/mailman/listinfo/python-list