On Wed, Apr 20, 2011 at 10:09:26PM -0400, gary engstrom wrote: > Dear Sir, > > Please excuse my akwardness as I a new to R and computers, but would kindly > appreciate help. > { > a <- sample (1:10,100,replace=T ) > b <-sample(10:20,100,replace=T) > c <- sample(20:30,100,replace=T) > d <- sample(30:40,100,replace=T) > e <- sample(40:50,100,replace=T) > } > d1 <- a > d2 <- b > d3 <-c > d4 <- d > d5 <- e > > data.frame(d1,d2,d3,d4,d5) > dd <- data.frame(d1,d2,d3,d4,d5) > dd > sd(d1) > summary(d1) > sd(d2) > summary(d2) > sd(d3) > summary(d3) > sd(d4) > summary(d4) > sd(d5) > summary(d5) > I am a beginner to R and am trying to learn statistical > probability. I have started Dr. Levine and Dr Kerns books. > So far from the usual sources, I haven't found the answers > to the following questions and would greatly appreciate > any assistance that anyone might kindly share. > If I run this code, how do I look for duplicate rows and how can
See ?duplicated . > I adjust the SD of the sample function to make the chances > of a duplicate row occur more often ? A simple way, how to increase the number of duplicated rows, is to reduce the space, from which the rows are drawn. The following estimates the probability to have at least one duplicated row using your original code. m <- 10000 count <- 0 for (i in 1:m) { d1 <- sample(1:10,100,replace=T) d2 <- sample(10:20,100,replace=T) d3 <- sample(20:30,100,replace=T) d4 <- sample(30:40,100,replace=T) d5 <- sample(40:50,100,replace=T) dd <- data.frame(d1,d2,d3,d4,d5) if (any(duplicated(dd))) { count <- count + 1 } } count/m I obtained [1] 0.035 This probability may also be computed exactly as follows. The number of all possible rows, from which we sample, is the product of the sizes of the sets, from which each component is chosen. This is 10*11^4. Using this, the probability to have at least one duplicated row among 100 rows chosen from the uniform distribution is N <- 10*11^4 # the number of all possible rows 1 - prod(1 - (0:99)/N) [1] 0.03325143 If the sample space is reduced to 8^5 using d1 <- sample(1:8,100,replace=T) d2 <- sample(11:18,100,replace=T) d3 <- sample(21:28,100,replace=T) d4 <- sample(31:38,100,replace=T) d5 <- sample(41:48,100,replace=T) then the probability to have at least one duplicated row increases to N <- 8^5 1 - prod(1 - (0:99)/N) [1] 0.1403373 Hope this helps. Petr Savicky. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.