Bruza wrote: > On Nov 16, 4:47 pm, Bruza <[EMAIL PROTECTED]> wrote: >> On Nov 16, 6:58 am, duncan smith <[EMAIL PROTECTED]> >> wrote: >> >> >> >>> Bruza wrote: >>>> I need to implement a "random selection" algorithm which takes a list >>>> of [(obj, prob),...] as input. Each of the (obj, prob) represents how >>>> likely an object, "obj", should be selected based on its probability >>>> of >>>> "prob".To simplify the problem, assuming "prob" are integers, and the >>>> sum of all "prob" equals 100. For example, >>>> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)] >>>> The algorithm will take a number "N", and a [(obj, prob),...] list as >>>> inputs, and randomly pick "N" objects based on the probabilities of >>>> the >>>> objects in the list. >>>> For N=1 this is pretty simply; the following code is sufficient to do >>>> the job. >>>> def foo(items): >>>> index = random.randint(0, 99) >>>> currentP = 0 >>>> for (obj, p) in items: >>>> currentP += w >>>> if currentP > index: >>>> return obj >>>> But how about the general case, for N > 1 and N < len(items)? Is there >>>> some clever algorithm using Python standard "random" package to do >>>> the trick? >>> I think you need to clarify what you want to do. The "probs" are >>> clearly not probabilities. Are they counts of items? Are you then >>> sampling without replacement? When you say N < len(items) do you mean N >>> <= sum of the "probs"? >>> Duncabn >> I think I need to explain on the probability part: the "prob" is a >> relative likelihood that the object will be included in the output >> list. So, in my example input of >> >> items = [('Mary',30), ('John', 10), ('Tom', 45), ('Jane', 15)] >> >> So, for any size of N, 'Tom' (with prob of 45) will be more likely to >> be included in the output list of N distinct member than 'Mary' (prob >> of 30) and much more likely than that of 'John' (with prob of 10). >> >> I know "prob" is not exactly the "probability" in the context of >> returning a multiple member list. But what I want is a way to "favor" >> some member in a selection process. >> >> So far, only Boris's solution is closest (but not quite) to what I >> need, which returns a list of N distinct object from the input >> "items". However, I tried with input of >> >> items = [('Mary',1), ('John', 1), ('Tom', 1), ('Jane', 97)] >> >> and have a repeated calling of >> >> Ben > > OOPS. I pressed the Send too fast. > > The problem w/ Boris's solution is that after repeated calling of > randomPick(3,items), 'Jane' is not the most "frequent appearing" > member in all the out list of 3 member lists... >
So it looks like you are sampling without replacement. For large N you don't want to be generating N individual observations. Do you want something like the following? >>> import randvars >>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 100) [34, 8, 40, 18] >>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 10000) [2984, 1003, 4511, 1502] >>> randvars.multinomial_variate([0.3, 0.1, 0.45, 0.15], 1000000) [300068, 99682, 450573, 149677] >>> The algorithm I use is to generate independent Poisson variates for each category, with Poisson parameters proportional to the probabilities. This won't usually give you a total of N, but the sample can be adjusted by adding / subtracting individual observations. In fact, I choose a constant of proportionality that aims to get the total close to, but not greater than N. I repeatedly generate Poisson counts until the sample size is not greater than N, then adjust by adding individual observations. (This algorithm might be in Knuth, and might be due to Ahrens and Dieter. I can't remember the exact source.) An efficient way to generate the individual observations is to construct a cumulative distribution function and use binary search. I know the code for this has been posted before. An alias table should generally be faster, but my Python coded effort is only similarly fast. Unfortunately generating Poisson variates isn't straightforward, and the approach I use requires the generation of gamma and beta variates (both available in the random module though). This approach is due to Ahrens and Dieter and is in Knuth, Vol.2. So yes, there is a clever algorithm that uses functions in the random module (Knuth, Ahrens and Dieter are quite clever). But it's not simple. The numpy random module has a function for Poisson variates, which would make things easier for you (I don't off-hand know which algorithm is used). Then it's just a case of choosing an appropriate constant of proportionality for the Poisson parameters. Try N - k, where k is the number of categories, for k <= N**0.5; otherwise, N - N**0.5 - (k - N**0.5)**0.5. Duncan -- http://mail.python.org/mailman/listinfo/python-list