Chris Rebert wrote: > On Mon, Oct 18, 2010 at 11:40 PM, Arnaud Delobelle <arno...@gmail.com> > wrote: >> elsa <kerensael...@hotmail.com> writes: >>> Hello, >>> >>> I'm trying to find a way to collect a set of values from real data, >>> and then sample values randomly from this data - so, the data I'm >>> collecting becomes a kind of probability distribution. For instance, I >>> might have age data for some children. It's very easy to collect this >>> data using a list, where the index gives the value of the data, and >>> the number in the list gives the number of times that values occurs: >>> >>> [0,0,10,20,5] >>> >>> could mean that there are no individuals that are no people aged 0, no >>> people aged 1, 10 people aged 2, 20 people aged 3, and 5 people aged 4 >>> in my data collection. >>> >>> I then want to make a random sample that would be representative of >>> these proportions - is there any easy and fast way to select an entry >>> weighted by its value? Or are there any python packages that allow you >>> to easily create your own distribution based on collected data? > <snip> >> If you want to keep it simple, you can do: >> >>>>> t = [0,0,10,20,5] >>>>> expanded = sum([[x]*f for x, f in enumerate(t)], []) >>>>> random.sample(expanded, 10) >> [3, 2, 2, 3, 2, 3, 2, 2, 3, 3] >>>>> random.sample(expanded, 10) >> [3, 3, 4, 3, 2, 3, 3, 3, 2, 2] >>>>> random.sample(expanded, 10) >> [3, 3, 3, 3, 3, 2, 3, 2, 2, 3] >> >> Is that what you need? > > The OP explicitly ruled that out: > >>> Two >>> other things to bear in mind are that in reality I'm collating data >>> from up to around 5 million individuals, so just making one long list >>> with a new entry for each individual won't work.
Python can cope with a list of 5 million integer entries just fine on average hardware. Eventually you may have to switch to Ian's cumulative sums approach -- but not necessarily at 10**6. >>> Also, it would be >>> good if I didn't have to decide before hand what the possible range of >>> values is (which unfortunately I have to do with the approach I'm >>> currently working on). This second objection seems invalid to me, too, and I think what Arnaud provides is a useful counterexample. However, if you (elsa) are operating near the limits of the available memory on your machine using sum() on lists is not a good idea. It does the equivalent of expanded = [] for x, f in enumerate(t): expanded = expanded + [x]*f which creates a lot of "large" temporary lists where you want the more memory-friendly expanded = [] for x, f in enumerate(t): expanded.extend([x]*f) # expanded += [x]*f > The internet is wrecking people's attention spans and reading > comprehension. Maybe, but I can't google the control group that is always offline and I have a hunch that facebook wouldn't work either ;) Peter -- http://mail.python.org/mailman/listinfo/python-list