On Thu, 12 Dec 2013 13:54:10 +0100, Peter Otten wrote: > Steven D'Aprano wrote: > >> In any case, sorting in Python is amazingly fast. You may be pleasantly >> surprised that a version that sorts your data, while nominally O(N log >> N), may be much faster than an O(N) solution that doesn't require >> sorted data. If I were a betting man, I'd be willing to wager a shiny >> new dollar[1] that sorting works out faster for reasonable sized sets >> of data. > > Well, that was my first reaction, too. But then
[snip timeit results] On my system, I don't get quite the same results as you. I find that the groupby solution is slightly faster than the dict solution, although both are slightly faster than the version I came up with. All three are plenty fast enough though. Using the same function definitions as you, I get: py> from timeit import Timer py> setup = "from __main__ import time_dict, time_groupby, time_daprano" py> t1 = Timer("_ = time_dict()", setup) py> t2 = Timer("_ = time_groupby()", setup) py> t3 = Timer("_ = time_daprano()", setup) and results: py> min(t1.repeat(number=100000, repeat=6)) 6.7522243447601795 py> min(t2.repeat(number=100000, repeat=6)) 6.7246054373681545 py> min(t3.repeat(number=100000, repeat=6)) 7.116707382723689 But, there's a flaw in the above analysis. The list that you give is pre- sorted, and Python's sort routine is *very* efficient at sorting already sorted lists. So I randomly shuffled it and redid the tests. Now the advantage of the dict solution is clear: py> from random import shuffle py> shuffle(a) py> min(t1.repeat(number=100000, repeat=6)) 6.611802507191896 py> shuffle(a) py> min(t2.repeat(number=100000, repeat=6)) 8.499449279159307 py> shuffle(a) py> min(t3.repeat(number=100000, repeat=6)) 9.61335832811892 The dict solution is hardly changed (slightly faster, although I guess that is just due to random timing fluctuations and not meaningful). The other two, on the other hand, are quite a bit slower with unsorted data. Nevertheless, it should be pointed out that even in the worst case, it's still pretty damn fast: less than 0.0001s to process a list with 78 items, or about 1.3 microseconds per item. A couple of other observations: - The collect() function is the same algorithm as the minmax_groupby() function, the only difference being the implementation. I would expect (but haven't tested) that the collect() version using a deque instead of manually iterating over the items in each group would be faster if there were a lot of items per group. - The dict version has an advantage that hashing integers is fast. If hashing the keys is slower, it may no longer be quite so good. - The dict version keeps a list of all the values it sees, including duplicates. Instead of a list, a set may be sufficient: def minmax_dict(items): d = collections.defaultdict(set) for key, value in items: d[key].add(value) for key, values in d.items(): yield key, min(values), max(values) It might even be worthwhile experimenting with something which simply stores the minimum and maximum values seen, rather than all the values. If there are a lot of items in each group, or if the comparisons are expensive, calculating the min and max in a single pass may be worthwhile: http://code.activestate.com/recipes/577916-fast-minmax-function/ -- Steven -- https://mail.python.org/mailman/listinfo/python-list