Try this: x <- c(1, 0.049, 0.129, 0.043, 0.013, 0.015, 0.040, 0.066, 0.038, 0.2040, 0.0221, 0.234, 0.0443, 0.0684, 0.035) cl <- kmeans(x, 5) cl newold <- with(cl, data.frame(old = x, new = centers[cluster])) newold
On Wed, Nov 19, 2008 at 10:43 AM, Random Walker <[EMAIL PROTECTED]> wrote: > > I have a list of entrants (1-14 in this example) in a competitive event and > corresponding win probabilities for each entrant. > > [(1, 0.049), (2, 0.129), (3, 0.043), (4, 0.013), (5, 0.015), (6, > 0.040), (7, 0.066), (8, 0.038), (9, 0.204), (10, 0.022), (11, 0.234), > (12, 0.044), (13, 0.068), (14, 0.035)] > > So, of course Sum(ps) = 1. > > In order to make some subsequent computations more tractable, I wish to > cluster entrant win probabilities like so: > > [(1, 0.049), (2, 0.121), (3, 0.049), (4, 0.024), (5, 0.024), (6, > 0.049), (7, 0.072), (8, 0.049), (9, 0.185), (10, 0.024), (11, 0.185), > (12, 0.049), (13, 0.072), (14, 0.049)] > > viz. in this case I have 'bucketed' the entrant numbers against 5 > representative probabilities and in subsequent computations will deem (for > example) the win probability of 3 to be 0.049, so another way of visualising > the result is: > > [((4, 5, 10), 0.024), > ((3, 6, 8, 12, 14), 0.049), > ((7, 13), 0.072), > ((2), 0.121), > ((11), 0.185)] > > and (3 * 0.024) + (5 * 0.049) + (2 * 0.072) + (1 x 0.121) + (1 x 0.185) ~= > 1. > > My question is: What is the most 'correct' way to cluster these > probabilities? In my case the problem is not totally unconstrained. I would > like to specify the number of buckets (probably will always wish to use > either 5 or 6), so I do not need an algorithm which determines the most > appropriate number of buckets given some cost function. I just need to know > for a given number of buckets, which entrants go in which buckets and what > is the representative probability for each bucket. > > The first thing which occurs to me is to sort probabilities into ascending > order, generate all partitions of the list into (say) 5 buckets, and pick > the partition which minimises the sum of squared differences from the mean > of each bucket summed over all buckets. If buckets were not associated with > probabilities I would do this without a second thought... but I wonder if > this is the right thing to do here? I'm too statistically naive to know one > way or the other. > > I would appreciate any suggestions re correct approach and also (obviously) > any tips on how one might go about this in R using canned functions. > > Many thanks! > > > > -- > View this message in context: > http://www.nabble.com/Bucketing-Grouping-Probabilities-tp20582544p20582544.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.