New submission from Stefan Pochmann <stefan.pochm...@gmail.com>:
The current implementation is: def multimode(data): counts = Counter(iter(data)).most_common() maxcount, mode_items = next(groupby(counts, key=itemgetter(1)), (0, [])) return list(map(itemgetter(0), mode_items)) The most_common() does a complete sort of Counter item tuples, taking O(n log n) time and quite big O(n) extra space (mostly for all those tuples). When Raymond Hettinger suggested it in https://bugs.python.org/issue35892#msg336338 he said it should have "running speed that is optimal for the desired result". But then he detailed that with "Slow O(n log n), loads all data in memory, full sort". Which seems like a mistake, as that's not optimal. It's easy to do in O(n) time and O(1) extra memory (in addition to the Counter and result, I mean): def multimode(data): counts = Counter(iter(data)) if not counts: return [] maxcount = max(counts.values()) return [value for value, count in counts.items() if count == maxcount] If there are only very few *different* values then the time/space after creating the Counter is insignificant compared to the Counter creation. But if there are many different values, it can be significant. statistics.mode takes O(n) time and O(1) space, which is optimal, but I found an apparently faster way anyway (code at end). For example for data = random.choices(range(n), k=n): | multimode | mode n | current proposal | current proposal -----------+-------------------+------------------ 1 | 131% 70% | 125% 58% 10 | 144% 73% | 119% 53% 100 | 126% 71% | 108% 29% 1,000 | 123% 65% | 62% 22% 10,000 | 172% 55% | 53% 18% 100,000 | 164% 44% | 55% 20% 1,000,000 | 85% 20% | 22% 4% 10,000,000 | 56% 12% | 11% 4% All four start with Counter(iter(data)), so I took that as baseline and the above results show relative additional times. For example 55% means if Counter construction alone took 10 seconds, the function took 15.5 seconds. An extreme case, data = list(range(n)): | multimode | mode n | current proposal | current proposal -----------+------------------+----------------- 1 | 128% 67% | 124% 56% 10 | 187% 93% | 141% 52% 100 | 316% 149% | 181% 45% 1,000 | 380% 174% | 213% 46% 10,000 | 349% 111% | 146% 30% 100,000 | 397% 128% | 159% 34% 1,000,000 | 336% 95% | 112% 24% 10,000,000 | 349% 97% | 109% 23% I also tried a bunch of other cases, didn't find one where my versions weren't quite a bit faster. My mode() version: from operator import indexOf from itertools import islice def mode(data): counts = Counter(iter(data)) if not counts: raise StatisticsError('no mode for empty data') from None maxcount = max(counts.values()) index = indexOf(counts.values(), maxcount) return next(islice(counts, index, None)) ---------- components: Library (Lib) messages: 406638 nosy: Stefan Pochmann priority: normal severity: normal status: open title: statistics.multimode is inefficient (time and space) (mode somewhat, too) type: performance versions: Python 3.10 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue45851> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com