Roy Smith wrote: > I have a list of items. I need to generate n samples of k unique items > each. I not only want each sample set to have no repeats, but I also > want to make sure the sets are disjoint (i.e. no item repeated between > sets). > > random.sample(items, k) will satisfy the first constraint, but not the > second. Should I just do random.sample(items, k*n), and then split the > resulting big list into n pieces? Or is there some more efficient way? > > Typical values: > > len(items) = 5,000,000 > n = 10 > k = 100,000
I would expect that your simple approach is more efficient than shuffling the whole list. Assuming there is a sample_iter(population) that generates unique items from the population (which has no repetitions itself) you can create the samples with g = sample_iter(items) samples = [list(itertools.islice(g, k) for _ in xrange(n)] My ideas for such a sample_iter(): def sample_iter_mark(items): n = len(items) while True: i = int(random()*n) v = items[i] if v is not None: yield v items[i] = None This is destructive and will degrade badly as the number of None items increases. For your typical values it seems to be OK though. You can make this non-destructive by adding a bit array or a set (random.Random.sample() has code that uses a set) to keep track of the seen items. Another sample_iter() (which is also part of the random.Random.sample() implementation): def sample_iter_replace(items): n = len(items) for k in xrange(n): i = int(random()*(n-k)) yield items[i] items[i] = items[n-k-1] You can micro-optimise that a bit to avoid the index calculation. Also, instead of overwriting items you could swap them, so that no values would be lost, only their initial order. -- http://mail.python.org/mailman/listinfo/python-list