On my computer your two examples seem to execute about the same: > fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions, + ...){ + lapplyFunctionRecurse <- function(cdata, level=1, ...){ + if(level==1){ + + return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), + function(x) lapplyFunctionRecurse(x, level+1, ...))) + } else if (level==length(pivotColumns)) { + # + return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), + function(x, ...) listNameFunctions(data[x,], ...))) + return(lapply(split(cdata,data[cdata,pivotColumns[level]], + drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]], + data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T), + sum(data[cdata,"A"], na.rm=T)))) + } else { + return(lapply(split(cdata,data[cdata,pivotColumns[level]], + drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) + } + } + result = lapplyFunctionRecurse(data, ...) + matrix2 <- do.call('rbind', lapply(result, function(x) + do.call('rbind',x))) + return(matrix2) + } > > Rprof() > dat <- data.frame(D=sample(32000:33000, 666000, + T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) > temp = proc.time(); ret = fedb.ddplyWrapper2Fast(dat, c("D", "Fid"), + function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > proc.time()-temp user system elapsed 23.44 7.37 30.86
> fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){ + lapplyFunctionRecurse <- function(cdata, level=1, ...){ + if(level==1){ + + return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), + function(x) lapplyFunctionRecurse(x, level+1, ...))) + } else if (level==length(pivotColumns)) { + #this line is different. it essentially calls the function you pass in + return(lapply(split(cdata,data[cdata,pivotColumns[level]], + drop=T), function(x, ...) listNameFunctions(data[x,], ...))) + } else { + return(lapply(split(cdata,data[cdata,pivotColumns[level]], + drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) + } + } + result = lapplyFunctionRecurse(data, ...) + matrix2 <- do.call('rbind', lapply(result, function(x) + do.call('rbind',x))) + return(matrix2) + } > > dat <- data.frame(D=sample(32000:33000, 666000, + T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) > temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), + function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > proc.time()-temp user system elapsed 24.06 7.38 31.50 If you run Rprof, most of the time is being spent accessing the dataframe. I would suggest that you convert the dataframe to a matrix to get better performance. Here is what I saw in the Rprof of the first example: 0 19.9 root 1. 19.7 fedb.ddplyWrapper2Fast 2. . 19.7 lapplyFunctionRecurse 3. . . 19.7 lapply 4. . . . 19.4 FUN 5. . . . . 19.4 lapplyFunctionRecurse 6. . . . . . 19.3 lapply 7. . . . . . . 18.6 FUN 8. . . . . . . . 18.6 listNameFunctions 9. . . . . . . . . 18.5 [ 10. . . . . . . . . . 18.3 [.data.frame <<- most of the time in accessing the data within a data frame. 11. . . . . . . . . . . 14.6 attr 11. . . . . . . . . . . 0.5 %in% 12. . . . . . . . . . . . 0.4 match 13. . . . . . . . . . . . . 0.4 is.factor 14. . . . . . . . . . . . . . 0.3 inherits 11. . . . . . . . . . . 0.5 [[ 12. . . . . . . . . . . . 0.5 [[.data.frame 13. . . . . . . . . . . . . 0.2 %in% 14. . . . . . . . . . . . . . 0.2 match 15. . . . . . . . . . . . . . . 0.1 is.factor 16. . . . . . . . . . . . . . . . 0.1 inherits 11. . . . . . . . . . . 0.4 anyDuplicated 12. . . . . . . . . . . . 0.2 anyDuplicated.default 11. . . . . . . . . . . 0.2 names 12. . . . . . . . . . . . 0.2 names 11. . . . . . . . . . . 0.1 vector 12. . . . . . . . . . . . 0.1 length 13. . . . . . . . . . . . . 0.1 length 7. . . . . . . 0.7 is.vector 8. . . . . . . . 0.7 split 9. . . . . . . . . 0.6 split.default 10. . . . . . . . . . 0.5 factor 11. . . . . . . . . . . 0.2 as.character 11. . . . . . . . . . . 0.1 unique 12. . . . . . . . . . . . 0.1 unique.default 10. . . . . . . . . . 0.2 [ 11. . . . . . . . . . . 0.1 [.data.frame 4. . . . 0.4 is.vector 5. . . . . 0.4 split 6. . . . . . 0.4 split.default 7. . . . . . . 0.4 factor 8. . . . . . . . 0.3 as.character 1. 0.1 data.frame On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler <rfor...@uchicago.edu> wrote: > So I have a function that does lapply's for me based on dimension. Currently > only works for length(pivotColumns)=2 because I haven't fixed the rbinds. I > have two versions. One runs WAYYY faster than the other. And I'm not sure > why. > > Fast Version: > > fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions, > ...){ > lapplyFunctionRecurse <- function(cdata, level=1, ...){ > if(level==1){ > > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), > function(x) lapplyFunctionRecurse(x, level+1, ...))) > } else if (level==length(pivotColumns)) { > # > return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), > function(x, ...) listNameFunctions(data[x,], ...))) > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]], > data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T), > sum(data[cdata,"A"], na.rm=T)))) > } else { > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) > } > } > result = lapplyFunctionRecurse(data, ...) > matrix2 <- do.call('rbind', lapply(result, function(x) > do.call('rbind',x))) > return(matrix2) > } > > > dat <- data.frame(D=sample(32000:33000, 666000, > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > proc.time()-temp > user system elapsed > 4.616 0.006 4.630 > #note in thie case the anonymous function I pass in isn't used because I > hardcode the function into the lapply. > > approx 4 seconds > > This runs very fast. This runs very slow: > > fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){ > lapplyFunctionRecurse <- function(cdata, level=1, ...){ > if(level==1){ > > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T), > function(x) lapplyFunctionRecurse(x, level+1, ...))) > } else if (level==length(pivotColumns)) { > #this line is different. it essentially calls the function you > pass in > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x, ...) listNameFunctions(data[x,], ...))) > } else { > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) > } > } > result = lapplyFunctionRecurse(data, ...) > matrix2 <- do.call('rbind', lapply(result, function(x) > do.call('rbind',x))) > return(matrix2) > } > > dat <- data.frame(D=sample(32000:33000, 666000, > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > proc.time()-temp > user system elapsed > 16.346 65.059 81.680 > > > > Can anyone explain to me why there is a 4x time difference? I don't want to > have to hardcore into the recursion function, but if I have to I will. > > Thanks, > Rob > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.