On 14/01/2019 20:11, duncan smith wrote: > Hello, > Just checking to see if anyone has attacked this problem before > for cases where the population size is unfeasibly large. i.e. The number > of categories is manageable, but the sum of the frequencies, N, > precludes simple solutions such as creating a list, shuffling it and > using the first n items to populate the sample (frequency distribution / > histogram). > > I note that numpy.random.hypergeometric will allow me to generate a > sample when I only have two categories, and that I could probably > implement some kind of iterative / partitioning approach calling this > repeatedly. But before I do I thought I'd ask if anyone has tackled this > before. Can't find much on the web. Cheers. > > Duncan >
After much tinkering I came up with the following: import numpy as np def hypgeom_variate(freqs, n): # recursive partitioning approach sample = [0] * len(freqs) cumsum = np.cumsum(list(chain([0], freqs))) if n > cumsum[-1]: raise ValueError('n cannot be greater than population size') hypergeometric = np.random.hypergeometric argslist = [(0, len(freqs), 0, cumsum[-1], n)] for i, k, ci, ck, m in argslist: if k == i + 1: sample[i] = m else: j = (i + k) // 2 cj = cumsum[j] x = hypergeometric(cj - ci, ck - cj, m, 1)[0] y = m-x if x: argslist.append((i, j, ci, cj, x)) if y: argslist.append((j, k, cj, ck, y)) return sample Cheers. Duncan -- https://mail.python.org/mailman/listinfo/python-list