Re: [Bioc-devel] Trying to reduce the memory overhead when using mclapply

Martin Morgan Thu, 14 Nov 2013 07:48:43 -0800

On 11/14/2013 12:13 AM, Leonardo Collado Torres wrote:

Dear BioC developers,


I am trying to understand how to use mclapply() without blowing up the
memory usage and need some help.

My use case is splitting a large IRanges::DataFrame() into chunks, and
feeding these chunks to mclapply(). Let say that I am using n cores and
that the operation I am doing uses K memory units.

That the data frame can be parallelized across rows implies that it can also bevectorized. It would be useful to confirm that your complicated function isactually fully vectorized, because the speed gains from vectorization can be100-1000 fold compared to the speed gains (and added complexity) of parallelevaluation.

A simple necessary condition might be that the function scales linearly orbetter with the number of rows, especially as the number of rows gets large.Even then there may be some obvious ways of speeding up the vectorized code,e.g., hoisting constant expressions from inside for loops or lapply's.

There are some incomplete hints in the 'Efficient R' links athttp://bioconductor.org/help/course-materials/2013/UnderstandingRBioc2013/ andthe 'working with large data' section ofhttp://bioconductor.org/help/course-materials/2013/Akron-Oct-2013/StatisticalComputing.pdf.


Martin


I understand that the individual jobs in mclapply() cannot detect how the
others are doing and if they need to run gc(). While this coupled n * K
  could explain a higher memory usage, I am running into higher than
expected memory loads.

I have tried
1) pre-splitting the data into a list (one element per chunk),
2) assigning the elements of the list as elements of an environment and the
using mclapply() over a set of indexes,
3) saving each chunk on its own Rdata file, then using mclapply with a
function that loads the appropriate chunk and then performs the operation
of interest.

Strategy 3 performs best in terms of max memory usage, but I am afraid that
it is more error prone due to having to write to disk.

Do you have any other ideas/tips on how to reduce the memory load? In other
words, is there a strategy to reduce the number of copies as much as
possible when using mclapply()?


I have a full example (with data.frame instead of DataFrame) and code
comparing the three options described above at http://bit.ly/1ar71yA


Thank you,
Leonardo

Leonardo Collado Torres, PhD student
Department of Biostatistics
Johns Hopkins University
Bloomberg School of Public Health
Website: http://www.biostat.jhsph.edu/~lcollado/
Blog: http://bit.ly/FellBit

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Trying to reduce the memory overhead when using mclapply

Reply via email to