Hi: This seems to take a bit less code, avoids explicit loops (by using mapply() instead, where the loops are internal) and takes about 10 seconds on my system:
m <- cbind(x = sample(1:15,2000000, replace=T), y = sample(1:10*1000, 2000000, replace=T)) sum(m[, 1]) # [1] 16005804 ff <- function(x, y) rep(y, x) system.time(w <- do.call(c, mapply(ff, m[, 1], m[, 2]))) user system elapsed 9.75 0.00 9.75 > length(w) [1] 16005804 > count(w) x freq 1 1000 1603184 2 2000 1590599 3 3000 1596661 4 4000 1607112 5 5000 1598571 6 6000 1599195 7 7000 1600475 8 8000 1601718 9 9000 1598896 10 10000 1609393 HTH, Dennis PS: It would have been a good idea to keep the OP in the loop of this thread. On Thu, Aug 18, 2011 at 12:46 AM, Timothy Bates <timothy.c.ba...@gmail.com> wrote: > This takes a few seconds to do 1 million lines, and remains explicit/for loop > form > > numberofSalaryBands = 1000000 # 2000000 > x = sample(1:15,numberofSalaryBands, replace=T) > y = sample((1:10)*1000, numberofSalaryBands, replace=T) > df = data.frame(x,y) > finalN = sum(df$x) > myVar = rep(NA, finalN) > outIndex = 1 > i = 1 > for (i in 1:numberofSalaryBands) { > kount = df$x[i] > myVar[outIndex:(outIndex+kount-1)] = rep(df$y[i], kount) # Make x[i] > copies of value y[i] > outIndex = outIndex+kount > } > head(myVar) > plyr::count(myVar) > > > On Aug 18, 2011, at 12:17 AM, Alex Ruiz Euler wrote: > >> >> >> Dear R community, >> >> I have a 2 million by 2 matrix that looks like this: >> >> x<-sample(1:15,2000000, replace=T) >> y<-sample(1:10*1000, 2000000, replace=T) >> x y >> [1,] 10 4000 >> [2,] 3 1000 >> [3,] 3 4000 >> [4,] 8 6000 >> [5,] 2 9000 >> [6,] 3 8000 >> [7,] 2 10000 >> (...) >> >> >> The first column is a population expansion factor for the number in the >> second column (household income). I want to expand the second column >> with the first so that I end up with a vector beginning with 10 >> observations of 4000, then 3 observations of 1000 and so on. In my mind >> the natural approach would be to create a NULL vector and append the >> expansions: >> >> myvar<-NULL >> myvar<-append(myvar, replicate(x[1],y[1]), 1) >> >> for (i in 2:length(x)) { >> myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) >> } >> >> to end with a vector of sum(x), which in my real database corresponds >> to 22 million observations. >> >> This works fine --if I only run it for the first, say, 1000 >> observations. If I try to perform this on all 2 million observations >> it takes long, way too long for this to be useful (I left it running >> 11 hours yesterday to no avail). >> >> >> I know R performs well with operations on relatively large vectors. Why >> is this so inefficient? And what would be the smart way to do this? >> >> Thanks in advance. >> Alex >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.