On Sat, Dec 25, 2010 at 7:42 AM, Roy Smith <r...@panix.com> wrote: > I'm processing a stream of N numbers and want to keep track of the K > largest. There's too many numbers in the stream (i.e. N is too large) > to keep in memory at once. K is small (100 would be typical). > > >From a theoretical point of view, I should be able to do this in N log K > time. What I'm doing now is essentially: > > top = [-1] # Assume all x are >= 0 > for x in input(): > if x <= top[0]: > continue > top.append(x) > if len(top) > K: > top.sort() > top.pop(0) > > I can see pathological cases (say, all input values the same) where > running time would be N K log K, but on average (N >> K and random > distribution of values), this should be pretty close to N. > > Is there a better way to do this, either from a theoretical running time > point of view, or just a nicer way to code this in Python? > -- > http://mail.python.org/mailman/listinfo/python-list >
heapq is probably fine, but I've empirically found a treap faster than a heap under some conditions: http://stromberg.dnsalias.org/~strombrg/treap/ http://stromberg.dnsalias.org/~strombrg/highest/ -- http://mail.python.org/mailman/listinfo/python-list