I don't know if this is helpful for BiocParallel, but there's an extension for the foreach package that ensures reproducible RNG behavior for all parallel backends: https://cran.r-project.org/web/packages/doRNG/index.html
Perhaps some of the principles from that package can be re-used? On Mon, Jan 7, 2019 at 9:37 AM Aaron Lun < [email protected]> wrote: > > I hope for 1. to have a 'local socket' (i.e., not using ports) > implementation shortly. > > Yes, that would be helpful. > > > I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We > now have > > > >> library(BiocParallel) > >> set.seed(1); p = bpparam(); rnorm(1) > > [1] -0.6264538 > >> set.seed(1); p = bpparam(); rnorm(1) > > [1] -0.6264538 > > > > at the expensive of using the generator when the package is loaded. > > > >> set.seed(1); rnorm(1) > > [1] -0.6264538 > >> set.seed(1); library(BiocParallel); rnorm(1) > > [1] 0.1836433 > > > > Is that bad? It will be consistent across platforms. > > Hm. I guess the changed behaviour is… better, in the sense that the second > scenario (setting the seed before loading the package) is less likely in > real analysis code. > > Even so, there are probably some edge cases where this could cause issues, > e.g., when: > > set.seed(1) > MyPackage::SomeFun() > > … where MyPackage causes BiocParallel to be attached, which presumably > changes the seed. > > Having thought about it for a while, the fact that bpparam() changes the > random seed is only a secondary issue. The main issue is that it doesn’t > change the seed reproducibly. So I wouldn’t mind *as much* if repeated > calls of: > > set.seed(1) > bpparam() > rnorm(1) > > … gave the same result, even if it were different from just running > “set.seed(1); rnorm(1)”. (Mind you, I’d still mind a little, but it > wouldn’t be so bad.) The biggest problem with the current state of affairs > is that the first call gives different results to all subsequent calls, > which really interferes with debugging attempts. > > > This behavior > > > >> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1))) > > [1] 0.9624337 0.8925947 > >> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1))) > > [1] -0.5703597 0.1102093 > > > > seems wrong, but is consistent with mclapply > > > >> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1))) > > [1] -0.02704527 0.40721777 > >> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1))) > > [1] -0.8239765 1.2957928 > > > > The documented behavior is to us the RNGseed= argument to *Param, but I > think it could be made consistent (by default, obey the global random > number seed on workers) at least on a single machine (where the default > number of cores is constant). > > I’m less concerned with that behaviour, given it’s inherently hard to take > randomization code written for serial execution and make it give the same > results on multiple cores (as we discussed elsewhere). > > > I have not (yet?) changed the default behavior to SerialParam. I guess > the cost of SerialParam is from the dependent packages that need to be > loaded > > > >> system.time(suppressPackageStartupMessages(library(DelayedArray))) > > user system elapsed > > 3.068 0.082 3.150 > > Does calling "SerialParam()” cause DelayedArray to be attached? That seems > odd. > > > If fastMNN() makes several calls to bplapply(), it might make sense to > start the default cluster at the top of the function once > > > > if (!isup(bpparam())) { > > bpstart(bpparam()) > > on.exit(bpstop(bpparam())) > > } > > This is probably a good idea to do in general to all of my parallelized > functions, though I don’t know how much this will solve the time problem. > Perhaps I should just do it and see. > > -A > > > On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" < > [email protected] on behalf of > [email protected]> wrote: > > > > As we know, the default BiocParallel backends are currently set to > MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to > some extent because a new user running, say, bplapply() without additional > arguments or set-up would expect some kind of parallelization. However, > from a developer’s perspective, I would argue that it makes more sense to > use SerialParam() by default. > > > > 1. It avoids problems with MulticoreParam stalling (especially on > Macs) when the randomly chosen port is in already use. This used to be a > major problem, to the point that all my BiocParallel-using functions in > scran passed BPPARAM=SerialParam() by default. Setting SerialParam() as > package default would ensure BiocParallel functions run properly in the > first place; if the code stalls due to switching to MulticoreParam, then > it’s obvious where the problem lies (and how to fix it). > > > > 2. It avoids the alteration of the random seed when the > MulticoreParam instance is constructed for the first time. > > > > library(BiocParallel) # new R session > > set.seed(100) > > invisible(bplapply(1:5, identity)) > > rnorm(1) # 0.1315312 > > set.seed(100) > > invisible(bplapply(1:5, identity)) > > rnorm(1) # -0.5021924 > > > > This is because the first bplapply() call calls bpparam(), which > constructs a MulticoreParam() for the first time; this calls the PRNG to > choose a random port number. Ensuing random numbers are altered, as seen > above. To avoid this, I need to define the MulticoreParam() object prior to > set.seed(), which undermines the utility of a default-defined bpparam(). > > > > 3. Job dispatch via SnowParam() is quite slow, which potentially > makes Windows package builds run slower by default. A particularly bad > example is that of scran::fastMNN(), which has a few matrix multiplications > that use DelayedArray:%*%. The %*% is parallelized with the default > bpparam(), resulting in SNOW parallelization on Windows. This slowed down > fastMNN()’s examples from 4 seconds (unix) to ~100 seconds (windows). > Clearly, serial execution is the faster option here. A related problem is > MulticoreParam()’s tendency to copy the environment, which may result in > problems from inflated memory consumption. > > > > So, can we default to SerialParam() on all platforms? And by this I > mean the BiocParallel in-built default - I don’t want to have to instruct > all my users to put a “register(SerialParam())” at the start of their > analysis scripts. I feel that BiocParallel’s job is to provide downstream > code with the potential for parallelization. If end-users want actual > parallelization, they had better be prepared to specify an appropriate > scheme via *Param() objects. > > > > -A > > > > > > > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > [email protected] mailing list > > https://stat.ethz.ch/mailman/listinfo/bioc-devel > > > > _______________________________________________ > [email protected] mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
