The main problem I’ve described refers to changes in the random seed due to the MulticoreParam() constructor, prior to dispatch to workers. For the related-but-separate problem of obtaining consistent random results within each worker, we’ve been discussing the possible solutions on another Bioc-devel thread (https://stat.ethz.ch/pipermail/bioc-devel/2019-January/014498.html <https://stat.ethz.ch/pipermail/bioc-devel/2019-January/014498.html>).
-A > On 7 Jan 2019, at 15:03, Ryan Thompson <[email protected]> wrote: > > I don't know if this is helpful for BiocParallel, but there's an extension > for the foreach package that ensures reproducible RNG behavior for all > parallel backends: https://cran.r-project.org/web/packages/doRNG/index.html > <https://cran.r-project.org/web/packages/doRNG/index.html> > > Perhaps some of the principles from that package can be re-used? > > On Mon, Jan 7, 2019 at 9:37 AM Aaron Lun > <[email protected] > <mailto:[email protected]>> wrote: > > I hope for 1. to have a 'local socket' (i.e., not using ports) > > implementation shortly. > > Yes, that would be helpful. > > > I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We now > > have > > > >> library(BiocParallel) > >> set.seed(1); p = bpparam(); rnorm(1) > > [1] -0.6264538 > >> set.seed(1); p = bpparam(); rnorm(1) > > [1] -0.6264538 > > > > at the expensive of using the generator when the package is loaded. > > > >> set.seed(1); rnorm(1) > > [1] -0.6264538 > >> set.seed(1); library(BiocParallel); rnorm(1) > > [1] 0.1836433 > > > > Is that bad? It will be consistent across platforms. > > Hm. I guess the changed behaviour is… better, in the sense that the second > scenario (setting the seed before loading the package) is less likely in real > analysis code. > > Even so, there are probably some edge cases where this could cause issues, > e.g., when: > > set.seed(1) > MyPackage::SomeFun() > > … where MyPackage causes BiocParallel to be attached, which presumably > changes the seed. > > Having thought about it for a while, the fact that bpparam() changes the > random seed is only a secondary issue. The main issue is that it doesn’t > change the seed reproducibly. So I wouldn’t mind *as much* if repeated calls > of: > > set.seed(1) > bpparam() > rnorm(1) > > … gave the same result, even if it were different from just running > “set.seed(1); rnorm(1)”. (Mind you, I’d still mind a little, but it wouldn’t > be so bad.) The biggest problem with the current state of affairs is that the > first call gives different results to all subsequent calls, which really > interferes with debugging attempts. > > > This behavior > > > >> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1))) > > [1] 0.9624337 0.8925947 > >> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1))) > > [1] -0.5703597 0.1102093 > > > > seems wrong, but is consistent with mclapply > > > >> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1))) > > [1] -0.02704527 0.40721777 > >> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1))) > > [1] -0.8239765 1.2957928 > > > > The documented behavior is to us the RNGseed= argument to *Param, but I > > think it could be made consistent (by default, obey the global random > > number seed on workers) at least on a single machine (where the default > > number of cores is constant). > > I’m less concerned with that behaviour, given it’s inherently hard to take > randomization code written for serial execution and make it give the same > results on multiple cores (as we discussed elsewhere). > > > I have not (yet?) changed the default behavior to SerialParam. I guess the > > cost of SerialParam is from the dependent packages that need to be loaded > > > >> system.time(suppressPackageStartupMessages(library(DelayedArray))) > > user system elapsed > > 3.068 0.082 3.150 > > Does calling "SerialParam()” cause DelayedArray to be attached? That seems > odd. > > > If fastMNN() makes several calls to bplapply(), it might make sense to > > start the default cluster at the top of the function once > > > > if (!isup(bpparam())) { > > bpstart(bpparam()) > > on.exit(bpstop(bpparam())) > > } > > This is probably a good idea to do in general to all of my parallelized > functions, though I don’t know how much this will solve the time problem. > Perhaps I should just do it and see. > > -A > > > On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" > > <[email protected] <mailto:[email protected]> > > on behalf of [email protected] > > <mailto:[email protected]>> wrote: > > > > As we know, the default BiocParallel backends are currently set to > > MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to > > some extent because a new user running, say, bplapply() without additional > > arguments or set-up would expect some kind of parallelization. However, > > from a developer’s perspective, I would argue that it makes more sense to > > use SerialParam() by default. > > > > 1. It avoids problems with MulticoreParam stalling (especially on Macs) > > when the randomly chosen port is in already use. This used to be a major > > problem, to the point that all my BiocParallel-using functions in scran > > passed BPPARAM=SerialParam() by default. Setting SerialParam() as package > > default would ensure BiocParallel functions run properly in the first > > place; if the code stalls due to switching to MulticoreParam, then it’s > > obvious where the problem lies (and how to fix it). > > > > 2. It avoids the alteration of the random seed when the MulticoreParam > > instance is constructed for the first time. > > > > library(BiocParallel) # new R session > > set.seed(100) > > invisible(bplapply(1:5, identity)) > > rnorm(1) # 0.1315312 > > set.seed(100) > > invisible(bplapply(1:5, identity)) > > rnorm(1) # -0.5021924 > > > > This is because the first bplapply() call calls bpparam(), which > > constructs a MulticoreParam() for the first time; this calls the PRNG to > > choose a random port number. Ensuing random numbers are altered, as seen > > above. To avoid this, I need to define the MulticoreParam() object prior to > > set.seed(), which undermines the utility of a default-defined bpparam(). > > > > 3. Job dispatch via SnowParam() is quite slow, which potentially makes > > Windows package builds run slower by default. A particularly bad example is > > that of scran::fastMNN(), which has a few matrix multiplications that use > > DelayedArray:%*%. The %*% is parallelized with the default bpparam(), > > resulting in SNOW parallelization on Windows. This slowed down fastMNN()’s > > examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial > > execution is the faster option here. A related problem is > > MulticoreParam()’s tendency to copy the environment, which may result in > > problems from inflated memory consumption. > > > > So, can we default to SerialParam() on all platforms? And by this I mean > > the BiocParallel in-built default - I don’t want to have to instruct all my > > users to put a “register(SerialParam())” at the start of their analysis > > scripts. I feel that BiocParallel’s job is to provide downstream code with > > the potential for parallelization. If end-users want actual > > parallelization, they had better be prepared to specify an appropriate > > scheme via *Param() objects. > > > > -A > > > > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > [email protected] <mailto:[email protected]> mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > <https://stat.ethz.ch/mailman/listinfo/bioc-devel> > > > > _______________________________________________ > [email protected] <mailto:[email protected]> mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > <https://stat.ethz.ch/mailman/listinfo/bioc-devel> [[alternative HTML version deleted]] _______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
