The following avoids the overhead of data.frame methods (and assumes the data.frame doesn't include matrices or other data.frames) and relies on split(vector,factor) quickly splitting a vector into a list of vectors. For a 10^6 row by 10 column data.frame split in 10^5 groups this took 14.1 seconds while split took 658.7 s. Both returned the same thing.
Perhaps something based on this idea would help your parallelized by(). mysplit.data.frame <- function (x, f, drop = FALSE, ...) { f <- as.factor(f) tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...)) rn <- split(rownames(x), f, drop = drop, ...) tmp <- unlist(unname(tmp), recursive = FALSE) tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp)))) tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) { t <- tmp[[i]] names(t) <- names(x) attr(t, "row.names") <- rn[[i]] class(t) <- "data.frame" t }) tmp } Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf Of Jim Holtman > Sent: Monday, October 10, 2011 7:29 PM > To: ivo welch > Cc: r-help > Subject: Re: [R] SLOW split() function > > instead of spliting the entire dataframe, split the indices and then use > these to access your data: > try > > system.time(s <- split(seq(nrow(d)), d$key)) > > this should be faster and less memory intensive. you can then use the > indices to access the subset: > > result <- lapply(s, function(.indx){ > doSomething <- sum(d$someCol[.indx]) > }) > > Sent from my iPad > > On Oct 10, 2011, at 21:01, ivo welch <ivo.we...@gmail.com> wrote: > > > dear R experts: apologies for all my speed and memory questions. I > > have a bet with my coauthors that I can make R reasonably efficient > > through R-appropriate programming techniques. this is not just for > > kicks, but for work. for benchmarking, my [3 year old] Mac Pro has > > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1. > > > > right now, it seems that 'split()' is why I am losing my bet. (split > > is an integral component of *apply() and by(), so I need split() to be > > fast. its resulting list can then be fed, e.g., to mclapply().) I > > made up an example to illustrate my ills: > > > > library(data.table) > > N <- 1000 > > T <- N*10 > > d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) )) > > setkey(d, "key"); gc() ## force a garbage collection > > cat("N=", N, ". Size of d=", object.size(d)/1024/1024, "MB\n") > > print(system.time( s<-split(d, d$key) )) > > > > My ordered input data table (or data frame; doesn't make a difference) > > is 114MB in size. it takes about a second to create. split() only > > needs to reshape it. this simple operation takes almost 5 minutes on > > my computer. > > > > with a data set that is larger, this explodes further. > > > > am I doing something wrong? is there an alternative to split()? > > > > sincerely, > > > > /iaw > > > > ---- > > Ivo Welch (ivo.we...@gmail.com) > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.