On Jan 9, 2:14 pm, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote: > On Fri, 09 Jan 2009 15:34:17 +0000, MRAB wrote: > > Marc 'BlackJack' Rintsch wrote: > > >> def iter_max_values(blocks, block_count): > >> for i, block in enumerate(blocks): > >> histogram = defaultdict(int) > >> for byte in block: > >> histogram[byte] += 1 > > >> yield max((count, byte) > >> for value, count in histogram.iteritems())[1] > > > [snip] > > Would it be faster if histogram was a list initialised to [0] * 256? > > Don't know. Then for every byte in the 2 GiB we have to call `ord()`. > Maybe the speedup from the list compensates this, maybe not. > > I think that we have to to something with *every* byte of that really > large file *at Python level* is the main problem here. In C that's just > some primitive numbers. Python has all the object overhead.
struct's B format might help here. Also, struct.unpack_from could probably be combined with mmap to avoid copying the input. Not to mention that the 0..256 ints are all saved and won't be allocated/ deallocated. -- http://mail.python.org/mailman/listinfo/python-list