On Fri, 09 Jan 2009 15:34:17 +0000, MRAB wrote: > Marc 'BlackJack' Rintsch wrote: >> On Fri, 09 Jan 2009 04:04:41 +0100, Johannes Bauer wrote: >> >>> As this was horribly slow (20 Minutes for a 2GB file) I coded the whole >>> thing in C also: >> >> Yours took ~37 minutes for 2 GiB here. This "just" ~15 minutes: >> >> #!/usr/bin/env python >> from __future__ import division, with_statement >> import os >> import sys >> from collections import defaultdict >> from functools import partial >> from itertools import imap >> >> >> def iter_max_values(blocks, block_count): >> for i, block in enumerate(blocks): >> histogram = defaultdict(int) >> for byte in block: >> histogram[byte] += 1 >> >> yield max((count, byte) >> for value, count in histogram.iteritems())[1] >> > [snip] > Would it be faster if histogram was a list initialised to [0] * 256?
I tried it on my computer, also getting character codes with struct.unpack, like this: histogram = [0,]*256 for byte in struct.unpack( '%dB'%len(block), block ): histogram[byte] +=1 yield max(( count, byte ) for idx, count in enumerate(histogram))[1] and I also removed the map( ord ... ) statement in main program, since iter_max_values mow returns character codes directly. The result is 10 minutes against the 13 of the original 'BlackJack's code on my PC (iMac Intel python 2.6.1). Strangely, using histogram = array.array( 'i', [0,]*256 ) gives again 13 minutes, even if I create the array outside the loop and then use histogram[:] = zero_array to reset the values. Ciao ----- FB -- http://mail.python.org/mailman/listinfo/python-list