Hello Ryan,

Thank you for looking at the example and even forking the gist. I have
updated the example with your approach and also calculated that it uses 6.794G
RAM when starting from scratch using 20 cores and thus beating the other 3
approaches under that scenario.

Also note that just generating the 'data' object uses 558.938M RAM. So at
least none of the approaches are at the ~ 550 Mb * 20 level.


So, without resorting to I/O the approach you suggested seems like the best
option. It still gets beaten by saving the data, re-starting R, then
mclapplying but this has the problem of having to re-start R.

Thank you,
Leo






On Thu, Nov 14, 2013 at 4:09 AM, Ryan <r...@thompsonclan.org> wrote:

> The minimize the additional memory used by mclapply, remember that
> mclapply works by forking processes, and the advantage of this is that
> as long as an object is not modified in either the parent or child,
> they will share the memory for that object, which effectively means
> that a child process *only* uses a significant amount of memory when it
> modifies existing objects (triggering creation of a copy) or creates a
> new object.
>
> In your case, there's no point in splitting the data (which results in
> creating copies). You only have to split the indices using
> parallel::splitIndices. I've tried to incorporate this into your gist:
> https://gist.github.com/DarwinAwardWinner/7463652
>
> The key line is:
>
>         res4 <- mclapply(splitIndices(nrow(data), opt$mcores), function(i)
> rowMeans(data[i,]), mc.cores=opt$mcores)
>
> Also, for concatenating the results, you can use "do.call(c,
> unname(res4))".
>
> On Thu Nov 14 00:13:41 2013, Leonardo Collado Torres wrote:
> > Dear BioC developers,
> >
> > I am trying to understand how to use mclapply() without blowing up the
> > memory usage and need some help.
> >
> > My use case is splitting a large IRanges::DataFrame() into chunks, and
> > feeding these chunks to mclapply(). Let say that I am using n cores and
> > that the operation I am doing uses K memory units.
> >
> > I understand that the individual jobs in mclapply() cannot detect how the
> > others are doing and if they need to run gc(). While this coupled n * K
> >   could explain a higher memory usage, I am running into higher than
> > expected memory loads.
> >
> > I have tried
> > 1) pre-splitting the data into a list (one element per chunk),
> > 2) assigning the elements of the list as elements of an environment and
> the
> > using mclapply() over a set of indexes,
> > 3) saving each chunk on its own Rdata file, then using mclapply with a
> > function that loads the appropriate chunk and then performs the operation
> > of interest.
> >
> > Strategy 3 performs best in terms of max memory usage, but I am afraid
> that
> > it is more error prone due to having to write to disk.
> >
> > Do you have any other ideas/tips on how to reduce the memory load? In
> other
> > words, is there a strategy to reduce the number of copies as much as
> > possible when using mclapply()?
> >
> >
> > I have a full example (with data.frame instead of DataFrame) and code
> > comparing the three options described above at http://bit.ly/1ar71yA
> >
> >
> > Thank you,
> > Leonardo
> >
> > Leonardo Collado Torres, PhD student
> > Department of Biostatistics
> > Johns Hopkins University
> > Bloomberg School of Public Health
> > Website: http://www.biostat.jhsph.edu/~lcollado/
> > Blog: http://bit.ly/FellBit
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to