I could be waay off base here, but my concern about presplitting the data is that you will have your data, and a second copy of our data that is something like a list where each element contains the portion of the data for that split. Good speed wise, bad memory wise. My hope with the technique I showed (again I may not have accomplished it) was to only have at anyone time, the original data and a copy of the particular elements being worked with. Of course this is not an issue if you have plenty of memory.
On Oct 10, 2011, at 12:19, Thomas Lumley <tlum...@uw.edu> wrote: > On Tue, Oct 11, 2011 at 7:54 AM, ivo welch <ivo.we...@gmail.com> wrote: >> hi josh---thx. I had a different version of this, and discarded it >> because I think it was very slow. the reason is that on each >> application, your version has to scan my (very long) data vector. (I >> have many thousand different cases, too.) I presume that by() has one >> scan through the vector that makes all splits. > > by.data.frame() is basically a wrapper for tapply(), and the key line > in tapply() is > ans <- lapply(split(X, group), FUN, ...) > which should be easy to adapt for mclapply. > > -- > Thomas Lumley > Professor of Biostatistics > University of Auckland ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.