Thanks for the code corrections. I see how for loops, append and naively populating a NULL vector can be so resource consuming. I tried the codes with 20 million observations in the following machine:
processor : 7 cpu family : 6 model name : Intel(R) Core(TM) i7 CPU Q 720 @ 1.60GHz cpu MHz : 933.000 cache size : 6144 KB First I tried Timothy's code and left it running for half an hour and I had to interrupt the command at Timing stopped at: 1033.516 829.147 1845.648 Then Dennis' option: user system elapsed 25.793 0.224 25.784 And for Paul's option, using a vector of length 20 million I had to stop at: Timing stopped at: 850.577 8.868 851.464 Not very efficient for relatively large vectors. I have also read that using {} instead of () to wrap for example {x+1} works faster, as do working directly with matrices instead of dataframes. Thanks for your input. Alex On Fri, 19 Aug 2011 13:58:09 +0000 Paul Hiemstra <paul.hiems...@knmi.nl> wrote: > As I already stated in my reply to your earlier post: > > resending the answer for the archives of the mailing list... > > Hi Alex, > > The other reply already gave you the R way of doing this while avoiding > the for loop. However, there is a more general reason why your for loop > is terribly inefficient. A small set of examples: > > largeVector = runif(10e4) > outputVector = NULL > system.time(for(i in 1:length(largeVector)) { > outputVector = append(outputVector, largeVector[i] + 1) > }) > # user system elapsed > # 6.591 0.168 6.786 > > The problem in this code is that outputVector keeps on growing and > growing. The operating system needs to allocate more and more space as > the object grows. This process is really slow. Several (much) faster > alternatives exist: > > # Pre-allocating the outputVector > outputVector = rep(0,length(largeVector)) > system.time(for(i in 1:length(largeVector)) { > outputVector[i] = largeVector[i] + 1 > }) > # user system elapsed > # 0.178 0.000 0.178 > # speed up of 37 times, this will only increase for large > # lengths of largeVector > > # Using apply functions > system.time(outputVector <- sapply(largeVector, function(x) return(x + 1))) > # user system elapsed > # 0.124 0.000 0.125 > # Even a bit faster > > # Using vectorisation > system.time(outputVector <- largeVector + 1) > # user system elapsed > # 0.000 0.000 0.001 > # Practically instant, 6780 times faster than the first example > > It is not always clear which method is most suitable and which performs > best. At least they all perform much, much better than the naive option > of letting outputVector grow. > > cheers, > Paul > > > > On 08/17/2011 11:17 PM, Alex Ruiz Euler wrote: > > > > Dear R community, > > > > I have a 2 million by 2 matrix that looks like this: > > > > x<-sample(1:15,2000000, replace=T) > > y<-sample(1:10*1000, 2000000, replace=T) > > x y > > [1,] 10 4000 > > [2,] 3 1000 > > [3,] 3 4000 > > [4,] 8 6000 > > [5,] 2 9000 > > [6,] 3 8000 > > [7,] 2 10000 > > (...) > > > > > > The first column is a population expansion factor for the number in the > > second column (household income). I want to expand the second column > > with the first so that I end up with a vector beginning with 10 > > observations of 4000, then 3 observations of 1000 and so on. In my mind > > the natural approach would be to create a NULL vector and append the > > expansions: > > > > myvar<-NULL > > myvar<-append(myvar, replicate(x[1],y[1]), 1) > > > > for (i in 2:length(x)) { > > myvar<-append(myvar,replicate(x[i],y[i]),sum(x[1:i])+1) > > } > > > > to end with a vector of sum(x), which in my real database corresponds > > to 22 million observations. > > > > This works fine --if I only run it for the first, say, 1000 > > observations. If I try to perform this on all 2 million observations > > it takes long, way too long for this to be useful (I left it running > > 11 hours yesterday to no avail). > > > > > > I know R performs well with operations on relatively large vectors. Why > > is this so inefficient? And what would be the smart way to do this? > > > > Thanks in advance. > > Alex > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > -- > Paul Hiemstra, Ph.D. > Global Climate Division > Royal Netherlands Meteorological Institute (KNMI) > Wilhelminalaan 10 | 3732 GK | De Bilt | Kamer B 3.39 > P.O. Box 201 | 3730 AE | De Bilt > tel: +31 30 2206 494 > > http://intamap.geo.uu.nl/~paul > http://nl.linkedin.com/pub/paul-hiemstra/20/30b/770 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.