On 14/01/2019 22:59, Gregory Ewing wrote: > duncan smith wrote: >> Hello, >> Just checking to see if anyone has attacked this problem before >> for cases where the population size is unfeasibly large. > > The fastest way I know of is to create a list of cumulative > frequencies, then generate uniformly distributed numbers and > use a binary search to find where they fall in the list. > That's O(log n) per sample in the size of the list once it's > been set up. >
That's the sort of thing I've been thinking about. But once I'd found the relevant category I'd need to reduce its frequency by 1 and correspondingly update the cumulative frequencies. Alternatively, I could add an extra step where I selected a unit from the relevant category with probability equal to the proportion of non-sampled units from the category. I could maybe set up an alias table and do something similar. The other thing I was thinking about was iterating through the categories (ideally from largest frequency to smallest frequency), generating the numbers to be sampled from the current category and the remaining categories (using numpy.random.hypergeometric). With a few large frequencies and lots of small frequencies that could be quite quick (on average). Alternatively I could partition the categories into two sets, generate the number to be sampled from each partition, then partition the partitions etc. binary search style. I suppose I'll try the both the alias table + rejection step and the recursive partitioning approach and see how they turn out. Cheers. Duncan -- https://mail.python.org/mailman/listinfo/python-list