Johannes Bauer <dfnsonfsdu...@gmx.de> writes: > Yup, I changed the Python code to behave the same way the C code did - > however overall it's not much of an improvement: Takes about 15 minutes > to execute (still factor 23).
Not sure this is completely fair if you're only looking for a pure Python solution, but to be honest, looping through a gazillion individual bytes of information sort of begs for trying to offload that into a library that can execute faster, while maintaining the convenience of Python outside of the pure number crunching. I'd assume numeric/numpy might have applicable functions, but I don't use those libraries much, whereas I've been using OpenCV recently for a lot of image processing work, and it has matrix/histogram support, which seems to be a good match for your needs. For example, assuming the OpenCV library and ctypes-opencv wrapper, add the following before the file I/O loop: from opencv import * # Histogram for each file chunk hist = cvCreateHist([256], CV_HIST_ARRAY, [(0,256)]) then, replace (using one of your posted methods as a sample): datamap = { } for i in data: datamap[i] = datamap.get(i, 0) + 1 array = sorted([(b, a) for (a, b) in datamap.items()], reverse=True) most = ord(array[0][1]) with: matrix = cvMat(1, len(data), CV_8UC1, data) cvCalcHist([matrix], hist) most = cvGetMinMaxHistValue(hist, min_val = False, max_val = False, min_idx = False, max_idx = True) should give you your results in a fraction of the time. I didn't run with a full size data file, but for a smaller one using smaller chunks the OpenCV varient ran in about 1/10 of the time, and that was while leaving all the other remaining Python code in place. Note that it may not be identical results to some of your other methods in the case of multiple values with the same counts, as the OpenCV histogram min/max call will always pick the lower value in such cases, whereas some of your code (such as above) will pick the upper value, or your original code depended on the order of information returned by dict.items. This sort of small dedicated high performance choke point is probably also perfect for something like Pyrex/Cython, although that would require a compiler to build the extension for the histogram code. -- David -- http://mail.python.org/mailman/listinfo/python-list