Hi William, I tested plyrs dlply function, and it seems to have have an O(N* log(R)) complexity (tested for R=N) so I do not know if N is the number of rows or nr of categories.
For the data.frame example with 2e5 rows and 2e5 categories it is approx. 10 times faster than split. Still, it is 10 seconds on an i7-5930K 3.5GHz Intel. It would be nice if the documentation would contain runtime complexity information and the documentation of base package function would point to function which should be used instead. Thanks On 29 June 2016 at 16:13, William Dunlap <wdun...@tibco.com> wrote: > I won't go into why splitting data.frames (or factors) uses time > proportional to the number of input rows times the number of > levels in the splitting factor, but you will get much better mileage > if you call split individually on each 'atomic' (numeric, character, ...) > variable and use mapply on the resulting lists. > > The plyr and dplyr packages were developed to deal with this > sort of problem. Check them out. > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Wed, Jun 29, 2016 at 6:21 AM, Witold E Wolski <wewol...@gmail.com> wrote: >> >> Hi, >> >> Here is an complete example which shows the the complexity of split or >> by is O(n^2) >> >> nrows <- c(1e3,5e3, 1e4 ,5e4, 1e5 ,2e5) >> res<-list() >> >> for(i in nrows){ >> dum <- data.frame(x = runif(i,1,1000), y=runif(i,1,1000)) >> res[[length(res)+1]]<-(system.time(x<- split(dum, 1:nrow(dum)))) >> } >> res <- do.call("rbind",res) >> plot(nrows^2, res[,"elapsed"]) >> >> And I can't see a reason why this has to be so slow. >> >> >> cheers >> >> >> >> >> >> >> >> On 29 June 2016 at 12:00, Rolf Turner <r.tur...@auckland.ac.nz> wrote: >> > On 29/06/16 21:16, Witold E Wolski wrote: >> >> >> >> It's the inverse problem to merging a list of data.frames into a large >> >> data.frame just discussed in the "performance of do.call("rbind")" >> >> thread >> >> >> >> I would like to split a data.frame into a list of data.frames >> >> according to first column. >> >> This SEEMS to be easily possible with the function base::by. However, >> >> as soon as the data.frame has a few million rows this function CAN NOT >> >> BE USED (except you have A PLENTY OF TIME). >> >> >> >> for 'by' runtime ~ nrow^2, or formally O(n^2) (see benchmark below). >> >> >> >> So basically I am looking for a similar function with better >> >> complexity. >> >> >> >> >> >> > nrows <- c(1e5,1e6,2e6,3e6,5e6) >> >>> >> >>> timing <- list() >> >>> for(i in nrows){ >> >> >> >> + dum <- peaks[1:i,] >> >> + timing[[length(timing)+1]] <- system.time(x<- by(dum[,2:3], >> >> INDICES=list(dum[,1]), FUN=function(x){x}, simplify = FALSE)) >> >> + } >> >>> >> >>> names(timing)<- nrows >> >>> timing >> >> >> >> $`1e+05` >> >> user system elapsed >> >> 0.05 0.00 0.05 >> >> >> >> $`1e+06` >> >> user system elapsed >> >> 1.48 2.98 4.46 >> >> >> >> $`2e+06` >> >> user system elapsed >> >> 7.25 11.39 18.65 >> >> >> >> $`3e+06` >> >> user system elapsed >> >> 16.15 25.81 41.99 >> >> >> >> $`5e+06` >> >> user system elapsed >> >> 43.22 74.72 118.09 >> > >> > >> > I'm not sure that I follow what you're doing, and your example is not >> > reproducible, since we have no idea what "peaks" is, but on a toy >> > example >> > with 5e6 rows in the data frame I got a timing result of >> > >> > user system elapsed >> > 0.379 0.025 0.406 >> > >> > when I applied split(). Is this adequately fast? Seems to me that if >> > you >> > want to split something, split() would be a good place to start. >> > >> > cheers, >> > >> > Rolf Turner >> > >> > -- >> > Technical Editor ANZJS >> > Department of Statistics >> > University of Auckland >> > Phone: +64-9-373-7599 ext. 88276 >> >> >> >> -- >> Witold Eryk Wolski >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > -- Witold Eryk Wolski ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.