On 11/14/2013 12:13 AM, Leonardo Collado Torres wrote:
Dear BioC developers,
I am trying to understand how to use mclapply() without blowing up the
memory usage and need some help.
My use case is splitting a large IRanges::DataFrame() into chunks, and
feeding these chunks to mclapply(). Let say that I am using n cores and
that the operation I am doing uses K memory units.
That the data frame can be parallelized across rows implies that it can also be
vectorized. It would be useful to confirm that your complicated function is
actually fully vectorized, because the speed gains from vectorization can be
100-1000 fold compared to the speed gains (and added complexity) of parallel
evaluation.
A simple necessary condition might be that the function scales linearly or
better with the number of rows, especially as the number of rows gets large.
Even then there may be some obvious ways of speeding up the vectorized code,
e.g., hoisting constant expressions from inside for loops or lapply's.
There are some incomplete hints in the 'Efficient R' links at
http://bioconductor.org/help/course-materials/2013/UnderstandingRBioc2013/ and
the 'working with large data' section of
http://bioconductor.org/help/course-materials/2013/Akron-Oct-2013/StatisticalComputing.pdf.
Martin
I understand that the individual jobs in mclapply() cannot detect how the
others are doing and if they need to run gc(). While this coupled n * K
could explain a higher memory usage, I am running into higher than
expected memory loads.
I have tried
1) pre-splitting the data into a list (one element per chunk),
2) assigning the elements of the list as elements of an environment and the
using mclapply() over a set of indexes,
3) saving each chunk on its own Rdata file, then using mclapply with a
function that loads the appropriate chunk and then performs the operation
of interest.
Strategy 3 performs best in terms of max memory usage, but I am afraid that
it is more error prone due to having to write to disk.
Do you have any other ideas/tips on how to reduce the memory load? In other
words, is there a strategy to reduce the number of copies as much as
possible when using mclapply()?
I have a full example (with data.frame instead of DataFrame) and code
comparing the three options described above at http://bit.ly/1ar71yA
Thank you,
Leonardo
Leonardo Collado Torres, PhD student
Department of Biostatistics
Johns Hopkins University
Bloomberg School of Public Health
Website: http://www.biostat.jhsph.edu/~lcollado/
Blog: http://bit.ly/FellBit
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel