[I] Reservoir sampling improvements (datasketches-java)

via GitHub Sun, 02 Feb 2025 22:45:05 -0800


jmalkin opened a new issue, #647:
URL: https://github.com/apache/datasketches-java/issues/647


   We can make a number of improvements to reservoir sampling:
   
   * We can go from 2 random draws per accepted sample to 1 by picking a random 
long from 0 to n and accepting if it's less than k -- which also provides the 
location to use, This would get around the current max size bit limit (which is 
unlikely to be an issue, but still) by removing the use of a double for a value 
in [0, 1).
   * After discussing with a stats person about a year ago, we believe we could 
even switch to a random draw from the distribution for the next item, dropping 
the random draws to O(accepted items) rather than O(n). As long as merges 
happen independent of that random draw (so no adversary able to inspect it) 
then we should be able to merge and re-draw for the merged sample.
   * Merge can be improved to give uniform second-order probabilities. This is 
the exciting one for me. But there's a caveat.
   
   Proposed merge, with python pseudocode:
   ```python
   # Merge two random samples. Samples are lists s1 and s2. The numbers n1 and 
n2
   # are the sizes of the sets from which s1 and s2 were sampled.  Value s is 
the specified
   # size for the merged sample. Requires s <= len(s1) and s <= len(s2). 
   def merge(s1, s2, n1, n2, s): 
     # Find number to draw from s1. (The others will be from s2.)
     t = 0 # Number of observations to draw from s1.
   
     for i in range(s):
       j = random.randint(1, n1 + n2)
       
       if j <= n1:
         t += 1
         n1 -= 1
       else: n2 -= 1
   
     # Draw the samples.
     a = random.sample(s1, t) # Sample without replacement.
     b = random.sample(s2, s - t) # Sample without replacement.
   
     return a + b # Make one list with the elements of a and b.
   
   ```
   The random sample without replacement is important here, which is the source 
of the caveat: You can almost use the first t and s-t samples from s1 and s2, 
respectively, except that there's a position bias from the initial reservoir 
fill. For items 0 to k-1, they can only ever appear at that specific index. We 
don't start placing randomly until after the reservoir is filled.
   
   The options I can think of for this:
   1. Randomly permute items when the reservoir fills. If it hasn't filled at 
merge, you're just inserting as new items of weight 1 anyway. Doesn't help with 
already-serialized samples.
   2. Randomly permute the input sample at merge (a modified Fisher-Yates 
shuffle since you can stop after the first t or s-t items). But this is an 
undesirable side-effect for merging, even if it doesn't impact the 
probabilities.
   3. Copy the array and permute that. Uses more space but avoids side-effects.
   
   Maybe we can mix 1 and 3? For new samples, randomly permute upon fill and 
track that with a flag (which would . If a sample needs to be merged and 
doesn't have the flag, then copy and permute the array.
   
   This should probably have been 3 separate issues. But wanted to document the 
ideas before I forget yet again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Reservoir sampling improvements (datasketches-java)

Reply via email to