from:"Paul Crowley"

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley



New submission from Paul Crowley :

random.sample currently uses either a Fisher-Yates shuffle, or rejection 
sampling, to achieve sampling without replacement. I propose using reservoir 
sampling or "cardchoose"; these are similar performance or sometimes faster, 
and don't need to allocate anything except the list used for the results.

--
components: Library (Lib)
messages: 328728
nosy: ciphergoth
priority: normal
severity: normal
status: open
title: Improved algorithms for random.sample
type: resource usage
versions: Python 3.8

___
Python tracker 
<https://bugs.python.org/issue35094>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley



Change by Paul Crowley :


--
keywords: +patch
pull_requests: +9510
stage:  -> patch review

___
Python tracker 
<https://bugs.python.org/issue35094>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley



Paul Crowley  added the comment:

I would be very grateful for your help finding those dicussions! I've tried 
this search:

https://www.google.com/search?q=python+%22Reservoir+sampling%22+rhettinger

and found this discussion:

https://mail.python.org/pipermail/python-ideas/2016-April/039708.html

but if I've missed any I'm keen to know.

In my pull request reservoir sampling is only used if 2k>=n, so it makes at 
most twice as many random requests as any other algorithm.

--

___
Python tracker 
<https://bugs.python.org/issue35094>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue35094] Improved algorithms for random.sample

2018-10-28 Thread Paul Crowley



Paul Crowley  added the comment:

Thank you for a very comprehensive and helpful answer!

Yep, reservoir sampling makes n calls not k calls, and so should only be used 
when k is a large fraction of n; in my patch it's k/n >= 1/2.

Because modern CPRNGs are so fast, I had been assuming that overall runtime, 
rather than calls to the RNG; I'll have to bear that in mind here, though in 
general "use a secure seed to whatever secure RNG is fastest" is the right 
strategy.

I don't think hedging against the quality of the RNG is the right thing to do 
here.

I don't mean to suggest you didn't think about this problem hard! It's just 
that I've been obsessing about this problem for the last few weeks for some 
reason (see my repo) so I thought I might be able to help. Thanks again for you 
reply!

--

___
Python tracker 
<https://bugs.python.org/issue35094>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue35094] Improved algorithms for random.sample

[issue35094] Improved algorithms for random.sample

[issue35094] Improved algorithms for random.sample

[issue35094] Improved algorithms for random.sample

4 matches

Site Navigation

Mail list logo

Footer information