On Wed, Dec 9, 2009 at 11:20 AM, Charles C. Berry <cbe...@tajo.ucsd.edu> wrote: > On Wed, 9 Dec 2009, Peng Yu wrote: > >> On Tue, Dec 8, 2009 at 11:06 PM, David Winsemius <dwinsem...@comcast.net> >> wrote: >>> >>> On Dec 9, 2009, at 12:00 AM, Peng Yu wrote: >>> >>>> On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius >>>> <dwinsem...@comcast.net> >>>> wrote: >>>>> >>>>> On Dec 8, 2009, at 11:28 PM, Peng Yu wrote: >>>>> >>>>>> I have the following code, which tests the split on a data.frame and >>>>>> the split on each column (as vector) separately. The runtimes are of >>>>>> 10 time difference. When m and k increase, the difference become even >>>>>> bigger. >>>>>> >>>>>> I'm wondering why the performance on data.frame is so bad. Is it a bug >>>>>> in R? Can it be improved? >>>>> >>>>> You might want to look at the data.table package. The author calinms >>>>> significant speed improvements over dta.frames >>>> >>>> This bug has been found long time back and a package has been >>>> developed for it. Should the fix be integrated in data.frame rather >>>> than be implemented in an additional package? >>> >>> What bug? >> >> Is the slow speed in splitting a data.frame a performance bug? >> > > NO! > > The two computations are not equivalent. > > One is a list whose elements are split vectors, and the other is a list of > data.frames containing those vectors.
I made a comparable example below. Still splitting data.frame is much slower comparing with the second way that I'm showing. > If you take the trouble to assemble that list of data frames from the list > of split vectors you will see that it is very time consuming. It is not as I show in the example below. > Read up on memory management issues. Think about what the computer actually > has to do in terms of memory access to split a data.frame versus split a > vector. I'd like to read more on how R do memory management. Would you please point me a good source? But again, R is not user friendly. It took me quite a long time to figure out that splitting a data.frame is a bottle neck in my program and reduce the problem into a test case. I don't know how memory management is done in R so that I don't know if it is possible to fix the problem for splitting a data.frame without perturbing the interface of data.frame. But if the speed of splitting data.frame is so slow, maybe it can be forbidden and an alternative can be documented somewhere. > --- > > And even if it were simply a matter of having code that is slow for some > application, that would not be a bug. Read the FAQ! The definition of a bug is on the FAQ is narrower than what I thought. No matter what a definition of a bug is, split() on a data.frame is perfectly legitimate operation (in terms of an interface). A quick fix to this problem is to at least single out the case where the argument is a data.frame, and to do what I have been doing below. Therefore, that is why I say this is a performance bug. Similar cases, where a faster alternative can be done but is not done, are perfect to call bugs, at least in many other languages. > m=300000 > n=6 > k=30000 > > set.seed(0) > x=replicate(n,rnorm(m)) > f=sample(1:k, size=m, replace=T) > > system.time(split(as.data.frame(x),f)) user system elapsed 39.020 0.010 39.084 > > v=lapply( + 1:dim(x)[[2]] + , function(i) { + split(x[,i],f) + } + ) > > system.time(lapply( + 1:dim(x)[[2]] + , function(i) { + split(x[,i],f) + } + ) + ) user system elapsed 2.520 0.000 2.526 > > system.time( + mapply( + function(...) { + cbind(...) + } + , v[[1]], v[[2]], v[[3]], v[[4]], v[[5]], v[[6]] + ) + ) user system elapsed 0.920 0.000 0.927 > >>>> >>>>> David. >>>>>> >>>>>>> system.time(split(as.data.frame(x),f)) >>>>>> >>>>>> user system elapsed >>>>>> 1.700 0.010 1.786 >>>>>>> >>>>>>> system.time(lapply( >>>>>> >>>>>> + 1:dim(x)[[2]] >>>>>> + , function(i) { >>>>>> + split(x[,i],f) >>>>>> + } >>>>>> + ) >>>>>> + ) >>>>>> user system elapsed >>>>>> 0.170 0.000 0.167 >>>>>> >>>>>> ########### >>>>>> m=30000 >>>>>> n=6 >>>>>> k=3000 >>>>>> >>>>>> set.seed(0) >>>>>> x=replicate(n,rnorm(m)) >>>>>> f=sample(1:k, size=m, replace=T) >>>>>> >>>>>> system.time(split(as.data.frame(x),f)) >>>>>> >>>>>> system.time(lapply( >>>>>> 1:dim(x)[[2]] >>>>>> , function(i) { >>>>>> split(x[,i],f) >>>>>> } >>>>>> ) >>>>>> ) >>>>>> >>>>>> ______________________________________________ >>>>>> R-help@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> David Winsemius, MD >>>>> Heritage Laboratories >>>>> West Hartford, CT >>>>> >>>>> >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> David Winsemius, MD >>> Heritage Laboratories >>> West Hartford, CT >>> >>> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive > Medicine > E mailto:cbe...@tajo.ucsd.edu UC San Diego > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.