Perfectly explained Ted. One might, at first reflection, consider that simply repeating the values 7 through 12 and sampling (w/o replacement) from among the 18 resulting values, would be similar to just doubling the selection probabilities for 7 through 12 and then sampling. That would clearly not be true though. Jim > > Whereas, if you replace x = c(1,2,3,4,5,6,7,8,9,110,11,12) > with the "weighted" equivalent, doubling up 7-12 as in your > x2 = c(1,2,3,4,5,6,7,7,8,8,9,9,10,10,11,11,12,12), > each of the 18 items now has the same weight as the others, > and the "unweighted" sampling > mean(replicate(1000000,(sample(x2, 3)))) > now gives the mean of the 18 values (7.5); whereas -- as my > calculation showed -- the effect of the "sequential" weighting is > to bias the result slightly downwards (in your example; generally, > in favour of the items with lower weights), since the way weighting > works in sample() is not equivalent to replicating each item "weight" > times. > > The general problem, of sampling without replacement in such a way > that for each item the probability that it is included in the sample > is proportional to a pre-assigned weight ("sampling with probability > proportional to size") is quite tricky and, for certain choices > of weights, impossible. For a glimpse of what's inside the can of > worms, have a look at the reference manual for the 'sampfling' > package, in particular the function samprop(): > > http://www.stats.bris.ac.uk/R/web/packages/sampfling/sampfling.pdf > > Ted. > > On 23-Jul-09 20:56:43, Jim Bouldin wrote: > > > > You are absolutely correct Ted. When no weights are applied it doesn't > > matter if you sample with or without replacement, because the > > probability > > of choosing any particular value is equally distributed among all such. > > But when they're weighted unequally that's not the case. > > > > It is also interesting to note that if the problem is set up slightly > > differently, by say defining the vector x as: > > x = c(1,2,3,4,5,6,7,7,8,8,9,9,10,10,11,11,12,12), effectively giving > > the > > same probability of selection for the 12 integers as before, the same > > problem does not arise, or at least not as severely: > > > >> x2 > > [1] 1 2 3 4 5 6 7 8 9 10 11 12 7 8 9 10 11 12 > > > >> d = mean(replicate(1000000,(sample(x2, 3))));d # (1 million samples > >> from > > x2, of size 3; the mean should be 7.50) > > [1] 7.499233 > > > >> e = mean(replicate(1000000,(sample(x2, 3, replace = TRUE))));e # (1 > > million samples from x2, of size 3; with replacement this time the mean > > should still be 7.50) > > [1] 7.502085 > > > >> d/e > > [1] 0.9996198 > > > > Jim > >> > >> To obtain the result you expected, you would need to explicitly > >> specify "replace=TRUE", since the default for "replace" is FALSE. > >> (though probably what you really intended was sampling without > >> replacement). > >> > >> -- when > >> replace=FALSE, the probability of inclusion of an element is not > >> proportional to its weight in 'prob'. > >> > >> The reason is that elements with higher weights are more likely > >> to be chosen early on. This then knocks that higher weight out > >> of the contest, making it more likely that elements with smaller > >> weights will be sampled subsequently. Hence the mean of the sample > >> will be biased slightly downwards, relative to the weighted mean > >> of the values in x. > >> > >> table(replicate(1000000,(sample(x, 3)))) > >> # 1 2 3 4 5 6 > >> # 250235 250743 249603 250561 249828 249777 > >> # 7 8 9 10 11 12 > >> # 249780 250478 249591 249182 249625 250597 > >> > >> (so all nice equal frequencies) > >> > >> table(replicate(1000000,(sample(x, 3,prob=weights)))) > >> # 1 2 3 4 5 6 > >> # 174873 175398 174196 174445 173240 174110 > >> # 7 8 9 10 11 12 > >> # 325820 326140 325289 325098 325475 325916 > >> > >> Note that the frequencies of the values with weight=2 are a bit > >> less than twice the frequencies of the values with weight=1: > >> > >> (325820+326140+325289+325098+325475+325916)/ > >> (174873+175398+174196+174445+173240+174110) > >> # [1] > >> > >> > >> In fact this is fairly easily caluclated. The possible combinations > >> (in order of sampling) of the two weights, with their probabilities, > >> are: > >> 1s 2s > >> ------------------------------------------- > >> 1 1 1 P = 6/18 * 5/17 * 4/16 3 0 > >> 1 1 2 P = 6/18 * 5/17 * 12/16 2 1 > >> 1 2 1 P = 6/18 * 12/17 * 5/15 2 1 > >> 1 2 2 P = 6/18 * 12/17 * 10/15 1 2 > >> 2 1 1 P = 12/18 * 6/16 * 5/15 2 1 > >> 2 1 2 P = 12/18 * 6/16 * 10/15 1 2 > >> 2 2 1 P = 12/18 * 10/16 * 6/14 1 2 > >> 2 2 2 P = 12/18 * 10/16 * 8/14 0 3 > >> > >> So the expected number of "weight=1" in the sample is > >> > >> 3*(6/18 * 5/17 * 4/16) + 2*(6/18 * 5/17 * 12/16) + > >> 2*(6/18 * 12/17 * 5/15) + 1*(6/18 * 12/17 * 10/15) + > >> 2*(12/18 * 6/16 * 5/15) + 1*(12/18 * 6/16 * 10/15) + > >> 1*(12/18 * 10/16 * 6/14) + 0 > >> = 1.046218 > >> > >> Hence the expected number of "weight=2" in the sample is > >> > >> 3 - 1.046218 = 1.953782 > >> > >> and their ratio 1.953782/1.046218 = 1.867471 > >> > >> Compare this with the value 1.867351 (above) obtained by simulation! > >> > >> Ted. > >> > >> -------------------------------------------------------------------- > >> E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> > >> Fax-to-email: +44 (0)870 094 0861 > >> Date: 23-Jul-09 Time: 21:05:07 > >> ------------------------------ XFMail ------------------------------ > >> > > > > Jim Bouldin, PhD > > Research Ecologist > > Department of Plant Sciences, UC Davis > > Davis CA, 95616 > > 530-554-1740 > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <ted.hard...@manchester.ac.uk> > Fax-to-email: +44 (0)870 094 0861 > Date: 23-Jul-09 Time: 23:00:56 > ------------------------------ XFMail ------------------------------ >
Jim Bouldin, PhD Research Ecologist Department of Plant Sciences, UC Davis Davis CA, 95616 530-554-1740 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.