Tim Chase wrote: > On 2014-02-16 04:12, Terry Reedy wrote: >> On 2/15/2014 11:41 PM, Tim Chase wrote: >> > data = ( >> > ("apple", 20), >> > ("orange", 50), >> > ("grape", 30), >> > ) > > To Ben, yes, this was just some sample data; the original gets built > from an external (i.e., client-supplied, thus the need to gracefully > support crazy-large numbers) data source and is indeed actually a list > rather than a tuple. When entering static data like this, I often > default to an outer tuple (rather than list) just as a hint/reminder > to myself that I don't expect this to change at runtime (and have > Python yell at me if I accidentally try). > >> If you actually start with date in this form, write the few lines >> needed to produce the form below. >> >> import bisect >> import random >> >> data = [ >> (0, 'apple'), >> (20, 'orange'), >> (70, 'grape'), >> ] >> >> for i in range(10): >> r = random.randrange(0, 100) >> i = bisect.bisect(data, (r, 'zzzzz')) - 1 >> print(data[i][1]) > > Trying to read what may be implicit assumptions in your code: > > 1) your code calculates "100" as sum(item[0] for item in data) > > 2) the data has to be sorted for bisect to work > > 3) you meant to write "(10, 'apple')" rather than 0. With my original > example code, a 0-probability shouldn't ever show up in the sampling, > where it looks like it might when using this sample code. In my > particular use case, I can limit/ensure that 0-probability items never > appear in the list, filtering them upon loading. > > 4) that "zzzzzz" is some arbitrary value that should come after any > string that could appear in the data; perhaps using some custom > "InfinityString" class where everything compared to it is always less > than it. > > So it would be > > class InfinityString: > def __gt__(self, other): True > __ge__ = __gt__ > def __lt__(self, other): False > __eq__ = __le__ = __ne__ = __lt__ > infinity_string = InfinityString() > data = load_data() # list of (quantity, value) tuples > data.sort() > total = sum(qty for qty, value in data) > for i in range(num_to_sample): > r = random.randrange(0, total) > i = bisect.bisect(data, (r, infinity_string)) - 1 > use(data[i][1]) > > Some long-running testing on this code seems to show that if two > items have the same probability, bisect only appears to find the last > one. Tested with > > data = [ > (10, "apple"), > (20, "banana"), # I never get any bananas, even after thousands of > iterations (20, "grape"), > (50, "orange"), > ]
I think it becomes simpler if you make an intermediate list with the cumulated probabilities. You can then avoid the InfinityString gymnastics: import random, bisect def cumulated(probs): sigma = 0 cumprobs = [] for p in probs: sigma += p cumprobs.append(sigma) return cumprobs def pick(cumprobs): return bisect.bisect(cumprobs, random.randrange(cumprobs[-1])) data = [ (10, "apple"), (20, "banana"), (20, "grape"), (50, "orange"), ] cumprobs = cumulated(p for p, k in data) # check for off-by-one bugs bins = [0] * len(cumprobs) for i in range(cumprobs[-1]): bins[bisect.bisect(cumprobs, i)] += 1 assert bins == [p for p, k in data] # use it bins = [0] * len(cumprobs) for i in range(10000): bins[pick(cumprobs)] += 1 for item, bin in zip(data, bins): print("{} --> {}".format(item, bin)) -- https://mail.python.org/mailman/listinfo/python-list