Steven D'Aprano added the comment: I've run some performance tests on six variations of the O(N) select algorithm, based on Tim Peters' and Thomas Ahle's code, comparing them to the naive O(N log N) "sort first" algorithm, and sorting is consistently faster up to the limit I tested.
About the tests I ran: - I tested four versions of Tim's median-of-median-of-k algorithm, for k = 7, 23, 47 and 97. - Thomas' "select" function, which is a median-of-median-of-3. - Thomas' "select2" function, which uses two pivots. - Data was randomly shuffled. - Functions were permitted to modify the data in place, and were not required to make a copy of the data first. E.g. I used alist.sort() rather than sorted(alist). - I ran two separate sets of tests. The first tested individual calls to the various selection functions, on random data. Each function got its own copy of the shuffled data. - The second set of tests called the selection function three times in a row, using different ranks, and used the average of the three times. My test suite is attached if anyone wants to critique it or run it themselves. Results: == Single call mode == N sort select7 select23 select47 select97 select select2 -------- -------- -------- -------- -------- -------- -------- -------- 5000 0.001 0.027 0.004 0.003 0.003 0.005 0.002 10000 0.002 0.008 0.006 0.005 0.005 0.007 0.006 50000 0.014 0.041 0.029 0.027 0.028 0.039 0.035 100000 0.035 0.088 0.069 0.065 0.067 0.132 0.067 500000 0.248 0.492 0.352 0.349 0.345 0.378 0.433 1000000 0.551 1.008 0.768 0.669 0.723 1.007 0.627 2000000 1.173 2.004 1.791 1.335 1.376 3.049 1.108 3000000 1.992 3.282 2.291 2.256 2.299 2.451 1.756 4000000 2.576 4.135 3.130 2.960 2.937 5.022 3.318 5000000 3.568 5.233 3.914 3.504 3.629 4.912 4.458 6000000 4.237 6.233 4.710 4.323 4.514 5.066 3.876 7000000 4.962 7.403 5.447 5.037 5.129 7.053 7.774 8000000 5.854 8.696 6.151 5.963 5.908 8.704 5.836 9000000 6.749 9.540 7.078 6.869 6.985 6.354 3.834 10000000 7.667 10.944 7.621 7.322 7.439 10.092 7.112 11000000 8.400 11.966 8.566 8.284 8.112 10.511 8.184 Total elapsed time: 23.84 minutes My conclusions from single calls: Thomas' select() and Tim's select7() as pure Python functions are too slow for serious contention. [Aside: I wonder how PyPy would go with them?] There's not much difference in performance between the various median-of-median-of-k functions for larger k, but it seems to me that overall k=47 is marginally faster than either k=23 or k=97. Overall, sorting is as good or better (and usually *much better*) than any of the pure-Python functions for the values of N tested, at least on my computer. C versions may be worth testing, but I'm afraid that is beyond me. Thomas' select2 using dual pivots seems like the most promising. There are a couple of anomalous results where select2 unexpectedly (to me!) does much, much better than sorting, e.g. for N=9 million. Pure chance perhaps? The overall trend seems to me to suggest that a pure-Python version of select2 may become reliably faster than sorting from N=10 million or so, at least with random data on my computer. YMMV, and I would expect that will non-random partially sorted data, the results may be considerably different. == Average of three calls mode == N sort select7 select23 select47 select97 select select2 -------- -------- -------- -------- -------- -------- -------- -------- 5000 0.001 0.012 0.007 0.008 0.007 0.022 0.007 10000 0.002 0.022 0.015 0.015 0.015 0.041 0.016 50000 0.016 0.125 0.086 0.080 0.085 0.259 0.073 100000 0.037 0.258 0.181 0.155 0.156 0.650 0.137 500000 0.242 1.374 0.950 0.963 1.075 4.828 1.135 1000000 0.564 2.892 1.998 1.952 2.100 5.055 1.721 2000000 1.227 5.822 4.084 3.876 4.070 18.535 3.379 3000000 2.034 8.825 6.264 6.256 5.798 29.206 4.851 4000000 2.761 12.275 8.209 7.767 9.111 38.186 8.899 5000000 3.587 14.829 10.289 10.385 10.685 53.101 8.149 6000000 4.320 17.926 12.925 12.455 12.639 73.876 10.336 7000000 5.237 21.504 15.221 14.740 16.167 87.315 12.254 8000000 6.145 24.503 16.918 15.761 18.430 103.394 16.923 9000000 6.947 26.801 19.993 18.755 20.676 106.303 16.444 10000000 8.113 30.933 21.352 20.341 20.417 102.421 16.987 11000000 9.031 33.912 24.676 23.624 22.448 114.279 18.698 Total elapsed time: 81.39 minutes In this set of tests, each function is called three times on the same set of data. As expected, once the list is sorted on the first call, sorting it again on the second call is very fast, and so the "sort" column is quite similar to the previous set of tests. What I didn't expect is just how badly the various other selection functions cope with being called three times on the same list with different ranks. The extreme case is Thomas' select() function. Total time to call it three times on a list of 11 million items is 342 seconds (3*114), compared to 10 seconds to call it once. I expected that having partially ordered the data on the first call, the second and third calls would take less time rather than more. Was I ever wrong. Unless my analysis is wrong, something bad is happening here, and I don't know what it is. [Aside: this suggests that, unlike sort() which can take advantage of partially ordered data to be more efficient, the other selection functions are hurt by partially ordered data. Is this analogous to simple versions of Quicksort which degrade to O(N**2) if the data is already sorted?] What is abundantly clear is that if you want to make more than one selection from a list, you ought to sort it first. Given these results, I do not believe that a pure-python implementation of any of these selection algorithms can be justified on performance grounds for CPython. Thanks to Tim Peters and Thomas Ahle for their valuable assistance in writing the selection functions in the first place. ---------- Added file: http://bugs.python.org/file35423/select.py _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue21592> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com