> On 17 Sep 2015, at 01:42, Dénes Tóth <toth.de...@ttk.mta.hu> wrote: > > > > On 09/16/2015 04:41 PM, Bert Gunter wrote: >> Yes! Chuck's use of mapply is exactly the split/combine strategy I was >> looking for. In retrospect, exactly how one should think about it. >> Many thanks to all for a constructive discussion . >> >> -- Bert >> >> >> Bert Gunter >> >>>>> >>>>> Use mapply like this on large problems: >>>>> >>>>> unsplit( >>>>> mapply( >>>>> function(x,z) eval( x, list( y=z )), >>>>> expression( A=y*2, B=y+3, C=sqrt(y) ), >>>>> split( dat$Flow, dat$ASB ), >>>>> SIMPLIFY=FALSE), >>>>> dat$ASB) >>>>> >>>>> Chuck >>>>> > > > Is there any reason not to use data.table for this purpose, especially if > efficiency is of concern? > > --- > > # load data.table and microbenchmark > library(data.table) > library(microbenchmark) > # > # prepare data > DF <- data.frame( > ASB = rep_len(factor(LETTERS[1:3]), 3e5), > Flow = rnorm(3e5)^2) > DT <- as.data.table(DF) > DT[, ASB := as.character(ASB)] > # > # define functions > # > # Chuck's version > fnSplit <- function(dat) { > unsplit( > mapply( > function(x,z) eval( x, list( y=z )), > expression( A=y*2, B=y+3, C=sqrt(y) ), > split( dat$Flow, dat$ASB ), > SIMPLIFY=FALSE), > dat$ASB) > } > # > # data.table-way (IMHO, much easier to read) > fnDataTable <- function(dat) { > dat[, > result := > if (.BY == "A") { > 2 * Flow > } else if (.BY == "B") { > 3 + Flow > } else if (.BY == "C") { > sqrt(Flow) > }, > by = ASB] > } > # > # benchmark > # > microbenchmark(fnSplit(DF), fnDataTable(DT)) > identical(fnSplit(DF), fnDataTable(DT)[, result]) > > --- > > Actually, in Chuck's version the unsplit() part is slow. If the order is not > of concern (e.g., DF is reordered before calling fnSplit), fnSplit is > comparable to the DT-version. >
But David’s version is faster than Chuck’s fnSplit. I modified David’s solution slightly to get a result that is identical to fnSplit. # David's version # my modification to return a vector just like fnSplit fnDavid <- function(dat) { z <- mapply( function(x,z) eval( x, list( y=z )), expression(A= y*2, B=y+3, C=sqrt(y) ), split( dat$Flow, dat$ASB ), USE.NAMES=FALSE, SIMPLIFY=TRUE ) as.vector(t(z)) } Added this to Dénes's code. Benchmarking with R package rbenchmark and testing result like this library(rbenchmark) benchmark(fnSplit(DF), fnDataTable(DT),fnDavid(DF)) identical(fnSplit(DF), fnDataTable(DT)[, result]) identical(fnSplit(DF), fnDavid(DF)) gave this: test replications elapsed relative user.self sys.self user.child 2 fnDataTable(DT) 100 0.829 1.000 0.762 0.066 0 3 fnDavid(DF) 100 1.615 1.948 1.515 0.098 0 1 fnSplit(DF) 100 2.878 3.472 2.685 0.190 0 sys.child 2 0 3 0 1 0 > identical(fnSplit(DF), fnDataTable(DT)[, result]) [1] TRUE > identical(fnSplit(DF), fnDavid(DF)) [1] TRUE Berend ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.