The main problem I’ve described refers to changes in the random seed due to the 
MulticoreParam() constructor, prior to dispatch to workers. For the 
related-but-separate problem of obtaining consistent random results within each 
worker, we’ve been discussing the possible solutions on another Bioc-devel 
thread (https://stat.ethz.ch/pipermail/bioc-devel/2019-January/014498.html 
<https://stat.ethz.ch/pipermail/bioc-devel/2019-January/014498.html>). 

-A

> On 7 Jan 2019, at 15:03, Ryan Thompson <[email protected]> wrote:
> 
> I don't know if this is helpful for BiocParallel, but there's an extension 
> for the foreach package that ensures reproducible RNG behavior for all 
> parallel backends: https://cran.r-project.org/web/packages/doRNG/index.html 
> <https://cran.r-project.org/web/packages/doRNG/index.html>
> 
> Perhaps some of the principles from that package can be re-used?
> 
> On Mon, Jan 7, 2019 at 9:37 AM Aaron Lun 
> <[email protected] 
> <mailto:[email protected]>> wrote:
> > I hope for 1. to have a 'local socket' (i.e., not using ports) 
> > implementation shortly.
> 
> Yes, that would be helpful.
> 
> > I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We now 
> > have
> > 
> >> library(BiocParallel)
> >> set.seed(1); p = bpparam(); rnorm(1)
> > [1] -0.6264538
> >> set.seed(1); p = bpparam(); rnorm(1)
> > [1] -0.6264538
> > 
> > at the expensive of using the generator when the package is loaded.
> > 
> >> set.seed(1); rnorm(1)
> > [1] -0.6264538
> >> set.seed(1); library(BiocParallel); rnorm(1)
> > [1] 0.1836433
> > 
> > Is that bad? It will be consistent across platforms.
> 
> Hm. I guess the changed behaviour is… better, in the sense that the second 
> scenario (setting the seed before loading the package) is less likely in real 
> analysis code.
> 
> Even so, there are probably some edge cases where this could cause issues, 
> e.g., when:
> 
> set.seed(1)
> MyPackage::SomeFun()
> 
> … where MyPackage causes BiocParallel to be attached, which presumably 
> changes the seed.
> 
> Having thought about it for a while, the fact that bpparam() changes the 
> random seed is only a secondary issue. The main issue is that it doesn’t 
> change the seed reproducibly. So I wouldn’t mind *as much* if repeated calls 
> of:
> 
> set.seed(1)
> bpparam()
> rnorm(1)
> 
> … gave the same result, even if it were different from just running 
> “set.seed(1); rnorm(1)”. (Mind you, I’d still mind a little, but it wouldn’t 
> be so bad.) The biggest problem with the current state of affairs is that the 
> first call gives different results to all subsequent calls, which really 
> interferes with debugging attempts.
> 
> > This behavior
> > 
> >> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
> > [1] 0.9624337 0.8925947
> >> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
> > [1] -0.5703597  0.1102093
> > 
> > seems wrong, but is consistent with mclapply
> > 
> >> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
> > [1] -0.02704527  0.40721777
> >> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
> > [1] -0.8239765  1.2957928
> > 
> > The documented behavior is to us the RNGseed= argument to *Param, but I 
> > think it could be made consistent (by default, obey the global random 
> > number seed on workers) at least on a single machine (where the default 
> > number of cores is constant).
> 
> I’m less concerned with that behaviour, given it’s inherently hard to take 
> randomization code written for serial execution and make it give the same 
> results on multiple cores (as we discussed elsewhere).
> 
> > I have not (yet?) changed the default behavior to SerialParam. I guess the 
> > cost of SerialParam is from the dependent packages that need to be loaded
> > 
> >> system.time(suppressPackageStartupMessages(library(DelayedArray)))
> >   user  system elapsed
> >  3.068   0.082   3.150
> 
> Does calling "SerialParam()” cause DelayedArray to be attached? That seems 
> odd.
> 
> > If fastMNN() makes several calls to bplapply(), it might make sense to 
> > start the default cluster at the top of the function once
> > 
> >    if (!isup(bpparam())) {
> >        bpstart(bpparam())
> >        on.exit(bpstop(bpparam()))
> >    }
> 
> This is probably a good idea to do in general to all of my parallelized 
> functions, though I don’t know how much this will solve the time problem. 
> Perhaps I should just do it and see.
> 
> -A
> 
> > On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" 
> > <[email protected] <mailto:[email protected]> 
> > on behalf of [email protected] 
> > <mailto:[email protected]>> wrote:
> > 
> >    As we know, the default BiocParallel backends are currently set to 
> > MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to 
> > some extent because a new user running, say, bplapply() without additional 
> > arguments or set-up would expect some kind of parallelization. However, 
> > from a developer’s perspective, I would argue that it makes more sense to 
> > use SerialParam() by default. 
> > 
> >    1. It avoids problems with MulticoreParam stalling (especially on Macs) 
> > when the randomly chosen port is in already use. This used to be a major 
> > problem, to the point that all my BiocParallel-using functions in scran 
> > passed BPPARAM=SerialParam() by default. Setting SerialParam() as package 
> > default would ensure BiocParallel functions run properly in the first 
> > place; if the code stalls due to switching to MulticoreParam, then it’s 
> > obvious where the problem lies (and how to fix it).
> > 
> >    2. It avoids the alteration of the random seed when the MulticoreParam 
> > instance is constructed for the first time. 
> > 
> >    library(BiocParallel) # new R session
> >    set.seed(100)
> >    invisible(bplapply(1:5, identity))
> >    rnorm(1) # 0.1315312
> >    set.seed(100)
> >    invisible(bplapply(1:5, identity))
> >    rnorm(1) # -0.5021924
> > 
> >    This is because the first bplapply() call calls bpparam(), which 
> > constructs a MulticoreParam() for the first time; this calls the PRNG to 
> > choose a random port number. Ensuing random numbers are altered, as seen 
> > above. To avoid this, I need to define the MulticoreParam() object prior to 
> > set.seed(), which undermines the utility of a default-defined bpparam().
> > 
> >    3. Job dispatch via SnowParam() is quite slow, which potentially makes 
> > Windows package builds run slower by default. A particularly bad example is 
> > that of scran::fastMNN(), which has a few matrix multiplications that use 
> > DelayedArray:%*%. The %*% is parallelized with the default bpparam(), 
> > resulting in SNOW parallelization on Windows. This slowed down fastMNN()’s 
> > examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial 
> > execution is the faster option here. A related problem is 
> > MulticoreParam()’s tendency to copy the environment, which may result in 
> > problems from inflated memory consumption.
> > 
> >    So, can we default to SerialParam() on all platforms? And by this I mean 
> > the BiocParallel in-built default - I don’t want to have to instruct all my 
> > users to put a “register(SerialParam())” at the start of their analysis 
> > scripts. I feel that BiocParallel’s job is to provide downstream code with 
> > the potential for parallelization. If end-users want actual 
> > parallelization, they had better be prepared to specify an appropriate 
> > scheme via *Param() objects. 
> > 
> >    -A
> > 
> > 
> > 
> > 
> >       [[alternative HTML version deleted]]
> > 
> >    _______________________________________________
> >    [email protected] <mailto:[email protected]> mailing list
> >    https://stat.ethz.ch/mailman/listinfo/bioc-devel 
> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> > 
> 
> _______________________________________________
> [email protected] <mailto:[email protected]> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel 
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


        [[alternative HTML version deleted]]

_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to