Hello Ryan, Thank you for looking at the example and even forking the gist. I have updated the example with your approach and also calculated that it uses 6.794G RAM when starting from scratch using 20 cores and thus beating the other 3 approaches under that scenario.
Also note that just generating the 'data' object uses 558.938M RAM. So at least none of the approaches are at the ~ 550 Mb * 20 level. So, without resorting to I/O the approach you suggested seems like the best option. It still gets beaten by saving the data, re-starting R, then mclapplying but this has the problem of having to re-start R. Thank you, Leo On Thu, Nov 14, 2013 at 4:09 AM, Ryan <r...@thompsonclan.org> wrote: > The minimize the additional memory used by mclapply, remember that > mclapply works by forking processes, and the advantage of this is that > as long as an object is not modified in either the parent or child, > they will share the memory for that object, which effectively means > that a child process *only* uses a significant amount of memory when it > modifies existing objects (triggering creation of a copy) or creates a > new object. > > In your case, there's no point in splitting the data (which results in > creating copies). You only have to split the indices using > parallel::splitIndices. I've tried to incorporate this into your gist: > https://gist.github.com/DarwinAwardWinner/7463652 > > The key line is: > > res4 <- mclapply(splitIndices(nrow(data), opt$mcores), function(i) > rowMeans(data[i,]), mc.cores=opt$mcores) > > Also, for concatenating the results, you can use "do.call(c, > unname(res4))". > > On Thu Nov 14 00:13:41 2013, Leonardo Collado Torres wrote: > > Dear BioC developers, > > > > I am trying to understand how to use mclapply() without blowing up the > > memory usage and need some help. > > > > My use case is splitting a large IRanges::DataFrame() into chunks, and > > feeding these chunks to mclapply(). Let say that I am using n cores and > > that the operation I am doing uses K memory units. > > > > I understand that the individual jobs in mclapply() cannot detect how the > > others are doing and if they need to run gc(). While this coupled n * K > > could explain a higher memory usage, I am running into higher than > > expected memory loads. > > > > I have tried > > 1) pre-splitting the data into a list (one element per chunk), > > 2) assigning the elements of the list as elements of an environment and > the > > using mclapply() over a set of indexes, > > 3) saving each chunk on its own Rdata file, then using mclapply with a > > function that loads the appropriate chunk and then performs the operation > > of interest. > > > > Strategy 3 performs best in terms of max memory usage, but I am afraid > that > > it is more error prone due to having to write to disk. > > > > Do you have any other ideas/tips on how to reduce the memory load? In > other > > words, is there a strategy to reduce the number of copies as much as > > possible when using mclapply()? > > > > > > I have a full example (with data.frame instead of DataFrame) and code > > comparing the three options described above at http://bit.ly/1ar71yA > > > > > > Thank you, > > Leonardo > > > > Leonardo Collado Torres, PhD student > > Department of Biostatistics > > Johns Hopkins University > > Bloomberg School of Public Health > > Website: http://www.biostat.jhsph.edu/~lcollado/ > > Blog: http://bit.ly/FellBit > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioc-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel