jmalkin opened a new issue, #647:
URL: https://github.com/apache/datasketches-java/issues/647
We can make a number of improvements to reservoir sampling:
* We can go from 2 random draws per accepted sample to 1 by picking a random
long from 0 to n and accepting if it's less than k -- which also provides the
location to use, This would get around the current max size bit limit (which is
unlikely to be an issue, but still) by removing the use of a double for a value
in [0, 1).
* After discussing with a stats person about a year ago, we believe we could
even switch to a random draw from the distribution for the next item, dropping
the random draws to O(accepted items) rather than O(n). As long as merges
happen independent of that random draw (so no adversary able to inspect it)
then we should be able to merge and re-draw for the merged sample.
* Merge can be improved to give uniform second-order probabilities. This is
the exciting one for me. But there's a caveat.
Proposed merge, with python pseudocode:
```python
# Merge two random samples. Samples are lists s1 and s2. The numbers n1 and
n2
# are the sizes of the sets from which s1 and s2 were sampled. Value s is
the specified
# size for the merged sample. Requires s <= len(s1) and s <= len(s2).
def merge(s1, s2, n1, n2, s):
# Find number to draw from s1. (The others will be from s2.)
t = 0 # Number of observations to draw from s1.
for i in range(s):
j = random.randint(1, n1 + n2)
if j <= n1:
t += 1
n1 -= 1
else: n2 -= 1
# Draw the samples.
a = random.sample(s1, t) # Sample without replacement.
b = random.sample(s2, s - t) # Sample without replacement.
return a + b # Make one list with the elements of a and b.
```
The random sample without replacement is important here, which is the source
of the caveat: You can almost use the first t and s-t samples from s1 and s2,
respectively, except that there's a position bias from the initial reservoir
fill. For items 0 to k-1, they can only ever appear at that specific index. We
don't start placing randomly until after the reservoir is filled.
The options I can think of for this:
1. Randomly permute items when the reservoir fills. If it hasn't filled at
merge, you're just inserting as new items of weight 1 anyway. Doesn't help with
already-serialized samples.
2. Randomly permute the input sample at merge (a modified Fisher-Yates
shuffle since you can stop after the first t or s-t items). But this is an
undesirable side-effect for merging, even if it doesn't impact the
probabilities.
3. Copy the array and permute that. Uses more space but avoids side-effects.
Maybe we can mix 1 and 3? For new samples, randomly permute upon fill and
track that with a flag (which would . If a sample needs to be merged and
doesn't have the flag, then copy and permute the array.
This should probably have been 3 separate issues. But wanted to document the
ideas before I forget yet again.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]