[issue35094] Improved algorithms for random.sample
New submission from Paul Crowley : random.sample currently uses either a Fisher-Yates shuffle, or rejection sampling, to achieve sampling without replacement. I propose using reservoir sampling or "cardchoose"; these are similar performance or sometimes faster, and don't need to allocate anything except the list used for the results. -- components: Library (Lib) messages: 328728 nosy: ciphergoth priority: normal severity: normal status: open title: Improved algorithms for random.sample type: resource usage versions: Python 3.8 ___ Python tracker <https://bugs.python.org/issue35094> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35094] Improved algorithms for random.sample
Change by Paul Crowley : -- keywords: +patch pull_requests: +9510 stage: -> patch review ___ Python tracker <https://bugs.python.org/issue35094> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35094] Improved algorithms for random.sample
Paul Crowley added the comment: I would be very grateful for your help finding those dicussions! I've tried this search: https://www.google.com/search?q=python+%22Reservoir+sampling%22+rhettinger and found this discussion: https://mail.python.org/pipermail/python-ideas/2016-April/039708.html but if I've missed any I'm keen to know. In my pull request reservoir sampling is only used if 2k>=n, so it makes at most twice as many random requests as any other algorithm. -- ___ Python tracker <https://bugs.python.org/issue35094> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35094] Improved algorithms for random.sample
Paul Crowley added the comment: Thank you for a very comprehensive and helpful answer! Yep, reservoir sampling makes n calls not k calls, and so should only be used when k is a large fraction of n; in my patch it's k/n >= 1/2. Because modern CPRNGs are so fast, I had been assuming that overall runtime, rather than calls to the RNG; I'll have to bear that in mind here, though in general "use a secure seed to whatever secure RNG is fastest" is the right strategy. I don't think hedging against the quality of the RNG is the right thing to do here. I don't mean to suggest you didn't think about this problem hard! It's just that I've been obsessing about this problem for the last few weeks for some reason (see my repo) so I thought I might be able to help. Thanks again for you reply! -- ___ Python tracker <https://bugs.python.org/issue35094> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com