Tim Peters <t...@python.org> added the comment:

Pro: focus on the "iterable" part of the title. If you want to, e.g., select 3 
lines "at random" out of a multi-million-line text file, this kind of reservoir 
sampling allows to do that holding no more than one line in memory 
simultaneously. Materializing an iterable in a sequence is sometimes 
impossible. Yup, it takes O(number of elements) time, but so does anything else 
that has to look at every element of an iterable (including materializing an 
iterable in a sequence!).

So this would make most sense as a different function.

Con: over the years we've worked diligently to purge these algorithms of minor 
biases, leaving them as unbiased as the core generator. Introducing 
floating-point vagaries goes against that, and results may vary at times due to 
tiny differences in platform exp() and log() implementations.

Worse, I'm not sure that the _approach_ is beyond reproach: the claim that 
skips "follow a geometric distribution" is baffling to me. If the reservoir has 
size K, then when we're looking at the N'th element it should be skipped with 
probability 1-K/N. But when looking at the N+1'th, that changes to 1-K/(N+1). 
Then 1-K/(N+2), and so on. These probabilities are all different. In an actual 
geometric distribution, the skip probability is a constant.

Now 1-K/(N+i) is approximately equal to 1-K/(N+j) for i and j sufficiently 
close, and for K sufficiently smaller than N, so the skips may be well 
_approximated_ by a geometric distribution. But that's quite different than 
saying they _do_ follow a geometric distribution, and I see no reason to trust 
that the difference can't matter.

The simple algorithm on the Wikipedia page suffers none of these potential 
problems (it implements an obviously-sound approach, & our `randrange()` is 
unbiased and platform-independent).  But, for selecting 3 out of a 
million-element iterable, would invoke the core generator hundreds of thousands 
of times more often.

----------
nosy: +mark.dickinson, tim.peters

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41311>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to