Roy Smith wrote: > In article <xns9e59a44d7cc49duncanbo...@127.0.0.1>, > Duncan Booth <duncan.bo...@invalid.invalid> wrote: > >> Roy Smith <r...@panix.com> wrote: >> >> > >> > I'm processing a stream of N numbers and want to keep track of the K >> > largest. There's too many numbers in the stream (i.e. N is too large) >> > to keep in memory at once. K is small (100 would be typical). >> > ... >> > Is there a better way to do this, either from a theoretical running >> > time point of view, or just a nicer way to code this in Python? >> >> How about: >> >> from heapq import nlargest >> top = nlargest(K, input()) >> >> It uses a heap so avoids completely resorting each time. > > Hmmm, that looks like it would solve the problem as stated, but there's > another twist. In addition to finding the K largest values, I *also* > need to do some other processing on all the values (which I didn't show > in the original example, incorrectly thinking it not germane to my > question). The problem with nlargest() is that it doesn't give me a > hook to do that.
If Paul's solution doesn't suffice -- the heapq module has the building blocks for a custom solution, e. g.: import heapq from functools import partial class NLargest(object): def __init__(self, n): self.n = n self.heap = [] def tally(self, item): heap = self.heap if len(heap) >= self.n: heapq.heappushpop(heap, item) self.tally = partial(heapq.heappushpop, heap) else: heapq.heappush(heap, item) def __str__(self): return str(sorted(self.heap, reverse=True)) if __name__ == "__main__": import random items = range(100) random.shuffle(items) accu = NLargest(10) for item in items: accu.tally(item) print item, accu > PS: I'm assuming heapq.nlargest(n, iterable) operates with memory > proportional to n, and not to the iterator length. That's the only > reasonable conclusion, but the docs don't actually come out and say it. I would hope so. -- http://mail.python.org/mailman/listinfo/python-list