Hi, I ran my example code with SerialParam() which had a negligible 4% memory increase between R 3.2.x and 3.1.x This 4% could very well fluctuate a little bit and might be non significantly different from 0 if I run the test more times.
I also added a second example using code based on my analysis script. With SerialParam(), the memory change is 13%, but with SnowParam() it's 82% between the R versions mentioned already using 10 cores. It's still far from the > 150% increase (2.5 fold change) I'm seeing with the real data. I initially thought that these observations ruled out everything else except SnowParam(). However, maybe the initial 13% memory increase multiplied by 10 (well, less then linear) is what I'm seeing with 10 cores (82% increase). The updated information is available at http://lcolladotor.github.io/SnowParam-memory/ As for what Vincent suggested of an AMI and EC2, I don't have experience with them. I'm not sure I'll be able to look into them and create a reproducible environment. Cheers, Leo On Fri, Jul 10, 2015 at 7:12 AM, Vincent Carey <st...@channing.harvard.edu> wrote: > I have had (potentially transient and environment-related) problems with > bplapply > in gQTLstats. I substituted the foreach abstractions and the code worked. > I still > have difficulty seeing how to diagnose the trouble I ran into. > > I'd suggest that you code so that you can easily substitute parallel- or > foreach- or > BatchJobs-based cluster control. This can help crudely isolate the source > of trouble. > > It would be very nice to have a way of measuring resource usage in cluster > settings, > both for diagnosis and strategy selection. For jobs that succeed, BatchJobs > records > memory used in its registry database, based on gc(). I would hope that > there are > tools that could be used to help one figure out how to factor a task so that > it is feasible > given some view of environment constraints. > > It might be useful for you to build an AMI and then a cluster that allows > replication of > the condition you are seeing on EC2. This could help with diagnosis and > might be > a basis for defining better instrumentation tools for both diagnosis and > planning. > > On Fri, Jul 10, 2015 at 12:23 AM, Leonardo Collado Torres <lcoll...@jhu.edu> > wrote: >> >> Hi, >> >> I have a script that at some point generates a list of DataFrame >> objects which are rather large matrices. I then feed this list to >> BiocParallel::bplapply() and process them. >> >> Previously, I noticed that in our SGE managed cluster using >> MulticoreParam() lead to 5 to 8 times higher memory usage as I posted >> in https://support.bioconductor.org/p/62551/#62877. Martin posted in >> https://support.bioconductor.org/p/62551/#62880 that "Probably the >> tools used to assess memory usage are misleading you." This could be >> true, but they are the tools that determine memory usage for all jobs >> in the cluster. Meaning that if my memory usage blows up according to >> these tools, my jobs get killed. >> >> That was with R 3.1.x and in particular running >> >> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh >> with >> >> $ sh step1-fullCoverage.sh brainspan >> >> which at the time (Nov 4th, 2014) used 173.5 GB of RAM with 10 cores. >> I recently tried to reproduce this (to check changes in run time given >> rtracklayer's improvements with BigWig files) using R 3.2.x and the >> memory went up to 450 GB before the job got killed given the maximum >> memory I specified for the job. The same is true using R 3.2.0. >> >> Between R 3.1.x and 3.2.0, `derfinder` is nearly identical (just one >> bug fix is different, for other code not used in this script). I know >> that BiocParallel changed quite a bit between those versions, and in >> particular SnowParam(). So that's why my prime suspect is >> BiocParallel. >> >> I made a smaller reproducible example which you can view at >> http://lcolladotor.github.io/SnowParam-memory/. This example uses a >> list of data frames with random data, and also uses 10 cores. You can >> see there that in R versions 3.1.x, 3.2.0 and 3.2.x, MulticoreParam() >> does use more memory than SnowParam(), as reported by SGE. Beyond the >> actual session info differences due to changes in BiocParalell's >> implementation, I noticed that the cluster type changed from PSOCK to >> SOCK. I ignore if this could explain the memory increase. >> >> The example doesn't generate the huge fold change between R 3.1.x and >> the other two versions (still 1.27x > 1x) that I see with my analysis >> script, so in that sense it's not the best example for the problem I'm >> observing. My tests with >> >> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.sh >> were between June 23rd and 28th, so maybe some recent changes in >> BiocParallel addressed this issue. >> >> >> I'm not sure how to proceed now. One idea is to make another example >> with the same type of objects and operations I use in my analysis >> script. >> >> A second one is to run my analysis script with SerialParam() on the >> different R versions to check if they use different amounts of memory >> which would suggest that the memory issue is not caused by >> SnowParam(). For example, maybe changes in rtracklayer are the ones >> driving the huge memory changes I'm seeing in my analysis scripts. >> >> However, I don't really suspect rtracklayer given the memory load >> reported that I checked manually a couple of times with "qmem". I >> believe that the memory blows up at >> >> https://github.com/leekgroup/derSoftware/blob/gh-pages/step1-fullCoverage.R#L124 >> which in turn uses derfinder::filterData(). This function imports: >> >> '[', '[<-', '[[', colnames, 'colnames<-', lapply methods from IRanges >> Rle, DataFrame from S4Vectors >> Reduce method from S4Vectors >> >> >> https://github.com/lcolladotor/derfinder/blob/master/R/filterData.R#L49-L51 >> >> >> Best, >> Leo >> >> >> History of analysis scripts doesn't reveal any other leads >> >> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.sh >> >> https://github.com/leekgroup/derSoftware/commits/gh-pages/step1-fullCoverage.R >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel > > _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel