Re: [Bioc-devel] Trying to reduce the memory overhead when using mclapply

Leonardo Collado Torres Thu, 14 Nov 2013 14:33:09 -0800

Hi Martin,

Thank you for the links, they contain a lot of useful information!

I am trying to understand more about mclapply because of mainly two cases.

1) I have a large DataFrame which I use because of the low memory footprint
and because the data is well behaved for compression using Rle's. I then do
some matrix operations (so yes, it's vectorized code) which can be done in
chunks. So I transform the data out of Rle world into a regular matrix.
This is much faster than using sqldf to store the full matrix and access it
in chunks.

2) Basically using coverage(readGAlignmentsFromBam(x)) over a BamFileList.
For reasons I do not understand, running this 1 job by chromosome uses at
max 2.7 GB of RAM, but using mclapply with 24 cores lead to 431.938G of RAM
used (way above 2.7 * 24). My best guess is the lack of gc() communication.

My goal is to keep wall clock time low while reducing the memory load if
possible.

I don't expect you to read the following as it details the complete use
case but there is more stuff going on.

If you want to check more in detail as of version 0.0.31 of 'derfinder'
(which we plan to submit to BioC after more testing/tweaking/adding S4
classes/adding vignette),

1) I split the data (might change thanks to Ryan's input on this thread)
using 
preprocessCoverage()<https://github.com/lcolladotor/derfinder/blob/master/R/preprocessCoverage.R>
and
then feed the result to an mclapply wrapper
calculateStats()<https://github.com/lcolladotor/derfinder/blob/master/R/calculateStats.R>
.

2) I use 
loadCoverage()<https://github.com/lcolladotor/derfinder/blob/master/R/loadCoverage.R>
to
get the the coverage from a BamFileList for a specific chromosome.
fullCoverage(<https://github.com/lcolladotor/derfinder/blob/master/R/fullCoverage.R>)
is an mclapply wrapper for loadCoverage() which ideally gives you more
control over I/O (to avoid majorly slowing the disk for other users) but
leads to huge memory usage.

Full scripts are at https://github.com/lcolladotor/derfinderExample

Best,
Leo

On Thu, Nov 14, 2013 at 10:47 AM, Martin Morgan <mtmor...@fhcrc.org> wrote:

> On 11/14/2013 12:13 AM, Leonardo Collado Torres wrote:
> > Dear BioC developers,
> >
> > I am trying to understand how to use mclapply() without blowing up the
> > memory usage and need some help.
> >
> > My use case is splitting a large IRanges::DataFrame() into chunks, and
> > feeding these chunks to mclapply(). Let say that I am using n cores and
> > that the operation I am doing uses K memory units.
>
> That the data frame can be parallelized across rows implies that it can
> also be
> vectorized. It would be useful to confirm that your complicated function is
> actually fully vectorized, because the speed gains from vectorization can
> be
> 100-1000 fold compared to the speed gains (and added complexity) of
> parallel
> evaluation.
>
> A simple necessary condition might be that the function scales linearly or
> better with the number of rows, especially as the number of rows gets
> large.
> Even then there may be some obvious ways of speeding up the vectorized
> code,
> e.g., hoisting constant expressions from inside for loops or lapply's.
>
> There are some incomplete hints in the 'Efficient R' links at
> http://bioconductor.org/help/course-materials/2013/UnderstandingRBioc2013/and
> the 'working with large data' section of
>
> http://bioconductor.org/help/course-materials/2013/Akron-Oct-2013/StatisticalComputing.pdf
> .
>
> Martin
>
> >
> > I understand that the individual jobs in mclapply() cannot detect how the
> > others are doing and if they need to run gc(). While this coupled n * K
> >   could explain a higher memory usage, I am running into higher than
> > expected memory loads.
> >
> > I have tried
> > 1) pre-splitting the data into a list (one element per chunk),
> > 2) assigning the elements of the list as elements of an environment and
> the
> > using mclapply() over a set of indexes,
> > 3) saving each chunk on its own Rdata file, then using mclapply with a
> > function that loads the appropriate chunk and then performs the operation
> > of interest.
> >
> > Strategy 3 performs best in terms of max memory usage, but I am afraid
> that
> > it is more error prone due to having to write to disk.
> >
> > Do you have any other ideas/tips on how to reduce the memory load? In
> other
> > words, is there a strategy to reduce the number of copies as much as
> > possible when using mclapply()?
> >
> >
> > I have a full example (with data.frame instead of DataFrame) and code
> > comparing the three options described above at http://bit.ly/1ar71yA
> >
> >
> > Thank you,
> > Leonardo
> >
> > Leonardo Collado Torres, PhD student
> > Department of Biostatistics
> > Johns Hopkins University
> > Bloomberg School of Public Health
> > Website: http://www.biostat.jhsph.edu/~lcollado/
> > Blog: http://bit.ly/FellBit
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/bioc-devel
> >
>
>
> --
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Trying to reduce the memory overhead when using mclapply

Reply via email to