I'm sorry, Rob, but that code is dense enough and formatted badly enough that it's hard to dig through.
You may want to try the data.table package. The development version on R-forge is pretty fast for grouping operations like this. I'm not sure if this is what you're really after. It's hard to tell from your example. Compare some speeds: > dat <- data.frame(D=sample(32000:33000, 666000,T), + Fid=sample(1:10,666000,T), + A=sample(1:5,666000,T)) > > ####### one of your examples > system.time(ret <- fedb.ddplyWrapper2(dat, c("D", "Fid"), + function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)))) user system elapsed 21.78 14.42 36.35 > > > ####### data.table > install.packages("data.table",repos="http://R-Forge.R-project.org") > library(data.table) > dt <- as.data.table(dat) > system.time(ret2 <- dt[, sum(A, na.rm=T), by = "D,Fid"]) user system elapsed 0.27 0.00 0.28 > > > ####### plyr for comparison, too > library(plyr) > system.time(ret3 <- ddply(dat, .(D,Fid), function(x) sum(x$A, na.rm=T))) user system elapsed 28.94 12.16 41.23 > head(ret) [,1] [,2] 1 175 175 2 222 222 3 221 221 4 134 134 5 253 253 6 194 194 > head(ret2) D Fid V1 [1,] 32000 1 228 [2,] 32000 2 209 [3,] 32000 3 182 [4,] 32000 4 180 [5,] 32000 5 181 [6,] 32000 6 222 > head(ret3) D Fid V1 1 32000 1 175 2 32000 2 222 3 32000 3 221 4 32000 4 134 5 32000 5 253 6 32000 6 194 - Tom On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler <rfor...@uchicago.edu> wrote: > So I have a function that does lapply's for me based on dimension. Currently > only works for length(pivotColumns)=2 because I haven't fixed the rbinds. I > have two versions. One runs WAYYY faster than the other. And I'm not sure > why. > > Fast Version: > > fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions, > ...){ > lapplyFunctionRecurse <- function(cdata, level=1, ...){ > if(level==1){ > > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), > function(x) lapplyFunctionRecurse(x, level+1, ...))) > } else if (level==length(pivotColumns)) { > # > return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), > function(x, ...) listNameFunctions(data[x,], ...))) > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]], > data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T), > sum(data[cdata,"A"], na.rm=T)))) > } else { > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) > } > } > result = lapplyFunctionRecurse(data, ...) > matrix2 <- do.call('rbind', lapply(result, function(x) > do.call('rbind',x))) > return(matrix2) > } > > > dat <- data.frame(D=sample(32000:33000, 666000, > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > proc.time()-temp > user system elapsed > 4.616 0.006 4.630 > #note in thie case the anonymous function I pass in isn't used because I > hardcode the function into the lapply. > > approx 4 seconds > > This runs very fast. This runs very slow: > > fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){ > lapplyFunctionRecurse <- function(cdata, level=1, ...){ > if(level==1){ > > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), > function(x) lapplyFunctionRecurse(x, level+1, ...))) > } else if (level==length(pivotColumns)) { > #this line is different. it essentially calls the function you > pass in > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x, ...) listNameFunctions(data[x,], ...))) > } else { > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) > } > } > result = lapplyFunctionRecurse(data, ...) > matrix2 <- do.call('rbind', lapply(result, function(x) > do.call('rbind',x))) > return(matrix2) > } > > dat <- data.frame(D=sample(32000:33000, 666000, > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > proc.time()-temp > user system elapsed > 16.346 65.059 81.680 > > head(ret3) D Fid V1 1 32000 1 175 2 32000 2 222 3 32000 3 221 4 32000 4 134 5 32000 5 253 6 32000 6 194 > > > Can anyone explain to me why there is a 4x time difference? I don't want to > have to hardcore into the recursion function, but if I have to I will. > > Thanks, > Rob > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.