On Tue, Dec 8, 2009 at 10:37 PM, David Winsemius <dwinsem...@comcast.net> wrote: > > On Dec 8, 2009, at 11:28 PM, Peng Yu wrote: > >> I have the following code, which tests the split on a data.frame and >> the split on each column (as vector) separately. The runtimes are of >> 10 time difference. When m and k increase, the difference become even >> bigger. >> >> I'm wondering why the performance on data.frame is so bad. Is it a bug >> in R? Can it be improved? > > You might want to look at the data.table package. The author calinms > significant speed improvements over dta.frames
This bug has been found long time back and a package has been developed for it. Should the fix be integrated in data.frame rather than be implemented in an additional package? > David. >> >>> system.time(split(as.data.frame(x),f)) >> >> user system elapsed >> 1.700 0.010 1.786 >>> >>> system.time(lapply( >> >> + 1:dim(x)[[2]] >> + , function(i) { >> + split(x[,i],f) >> + } >> + ) >> + ) >> user system elapsed >> 0.170 0.000 0.167 >> >> ########### >> m=30000 >> n=6 >> k=3000 >> >> set.seed(0) >> x=replicate(n,rnorm(m)) >> f=sample(1:k, size=m, replace=T) >> >> system.time(split(as.data.frame(x),f)) >> >> system.time(lapply( >> 1:dim(x)[[2]] >> , function(i) { >> split(x[,i],f) >> } >> ) >> ) >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > David Winsemius, MD > Heritage Laboratories > West Hartford, CT > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.