[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Oscar Benjamin Fri, 17 Jul 2020 04:42:14 -0700

Oscar Benjamin <oscar.j.benja...@gmail.com> added the comment:

All good points :)


Here's an implementation with those changes and that shuffles but gives the 
option to preserve order. It also handles the case W=1.0 which can happen at 
the first step with probability 1 - (1 - 2**53)**k.

Attempting to preserve order makes the storage requirements expected 
O(k*log(k)) rather than deterministic O(k) but note that the log(k) part just 
refers to the values list growing larger with references to None: only k of the 
items from iterable are stored at any time. This can be simplified by removing 
the option to preserve order which would also make it faster in the 
small-iterable case.

There are a few timings below for choosing from a dict vs converting to a list 
and using sample (I don't have a 3.9 build immediately available to use 
choices). Note that these benchmarks are not the primary motivation for 
sample_iter though which is the case where the underlying iterable is much more 
expensive in memory and/or time and where the length is not known ahead of time.



from math import exp, log, log1p, floor
from random import random, randrange, shuffle as _shuffle
from itertools import islice


def sample_iter(iterable, k=1, shuffle=True):
    """Choose a sample of k items from iterable

    shuffle=True (default) gives the items in random order
    shuffle=False preserves the original ordering of the items
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    irange = range(len(values))
    indices = dict(zip(irange, irange))

    kinv = 1 / k
    W = 1.0
    while True:
        W *= random() ** kinv
        # random() < 1.0 but random() ** kinv might not be
        # W == 1.0 implies "infinite" skips
        if W == 1.0:
            break
        # skip is geometrically distributed with parameter W
        skip = floor( log(random())/log1p(-W) )
        try:
            newval = next(islice(iterator, skip, skip+1))
        except StopIteration:
            break
        # Append new, replace old with dummy, and keep track of order
        remove_index = randrange(k)
        values[indices[remove_index]] = None
        indices[remove_index] = len(values)
        values.append(newval)

    values = [values[indices[i]] for i in irange]

    if shuffle:
        _shuffle(values)

    return values


Timings for a large dict (1,000,000 items):

In [8]: n = 6                                                                   
                                                               

In [9]: d = dict(zip(range(10**n), range(10**n)))                               
                                                               

In [10]: %timeit sample_iter(d, 10)                                             
                                                               
16.1 ms ± 363 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit sample(list(d), 10)                                            
                                                               
26.3 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Timings for a small dict (5 items):

In [14]: d2 = dict(zip(range(5), range(5)))                                     
                                                               

In [15]: %timeit sample_iter(d2, 2)                                             
                                                               
14.8 µs ± 539 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [16]: %timeit sample(list(d2), 2)                                            
                                                               
6.27 µs ± 457 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


The crossover point for this benchmark is around 10,000 items with k=2. 
Profiling at 10,000 items with k=2 shows that in either case the time is 
dominated by list/next so the time difference is just about how efficiently we 
can iterate vs build the list. For small dicts it is probably possible to get a 
significant factor speed up by removing the no shuffle option and simplifying 
the routine.


> Although why it keeps taking k'th roots remains a mystery to me ;-)

Thinking of sample_iter_old, before doing a swap the uvals in our reservoir 
look like:

  U0 = {u[1], u[2], ... u[k-1], W0}
  W0 = max(V0)

Here u[1] ... u[k-1] are uniform in (0, W0). We find a new u[n] < W0 which we 
swap in while removing W0 and afterwards we have

  U1 = {u[1], u[2], ... u[k-1], u[k]}
  W1 = max(U1)

Given that U1 is k iid uniform variates in (0, W0) we have that

  W1 = W0 * max(random() for _ in range(k)) = W0 * W'

Here W' has cdf x**k and so by the inverse sampling method we can generate it 
as random()**(1/k). That gives the update rule for sample_iter:

  W *= random() ** (1/k)

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41311>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41311] Add a function to get a random sample from an iterable (reservoir sampling)

Reply via email to