> I hope for 1. to have a 'local socket' (i.e., not using ports) implementation 
> shortly.

Yes, that would be helpful.

> I committed a patch in 1.17.6 for the wrong-seeming behavior of 2. We now have
> 
>> library(BiocParallel)
>> set.seed(1); p = bpparam(); rnorm(1)
> [1] -0.6264538
>> set.seed(1); p = bpparam(); rnorm(1)
> [1] -0.6264538
> 
> at the expensive of using the generator when the package is loaded.
> 
>> set.seed(1); rnorm(1)
> [1] -0.6264538
>> set.seed(1); library(BiocParallel); rnorm(1)
> [1] 0.1836433
> 
> Is that bad? It will be consistent across platforms.

Hm. I guess the changed behaviour is… better, in the sense that the second 
scenario (setting the seed before loading the package) is less likely in real 
analysis code.

Even so, there are probably some edge cases where this could cause issues, 
e.g., when:

set.seed(1)
MyPackage::SomeFun()

… where MyPackage causes BiocParallel to be attached, which presumably changes 
the seed.

Having thought about it for a while, the fact that bpparam() changes the random 
seed is only a secondary issue. The main issue is that it doesn’t change the 
seed reproducibly. So I wouldn’t mind *as much* if repeated calls of:

set.seed(1)
bpparam()
rnorm(1)

… gave the same result, even if it were different from just running 
“set.seed(1); rnorm(1)”. (Mind you, I’d still mind a little, but it wouldn’t be 
so bad.) The biggest problem with the current state of affairs is that the 
first call gives different results to all subsequent calls, which really 
interferes with debugging attempts.

> This behavior
> 
>> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
> [1] 0.9624337 0.8925947
>> set.seed(1); unlist(bplapply(1:2, function(i) rnorm(1)))
> [1] -0.5703597  0.1102093
> 
> seems wrong, but is consistent with mclapply
> 
>> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
> [1] -0.02704527  0.40721777
>> set.seed(1); unlist(mclapply(1:2, function(i) rnorm(1)))
> [1] -0.8239765  1.2957928
> 
> The documented behavior is to us the RNGseed= argument to *Param, but I think 
> it could be made consistent (by default, obey the global random number seed 
> on workers) at least on a single machine (where the default number of cores 
> is constant).

I’m less concerned with that behaviour, given it’s inherently hard to take 
randomization code written for serial execution and make it give the same 
results on multiple cores (as we discussed elsewhere).

> I have not (yet?) changed the default behavior to SerialParam. I guess the 
> cost of SerialParam is from the dependent packages that need to be loaded
> 
>> system.time(suppressPackageStartupMessages(library(DelayedArray)))
>   user  system elapsed
>  3.068   0.082   3.150

Does calling "SerialParam()” cause DelayedArray to be attached? That seems odd.

> If fastMNN() makes several calls to bplapply(), it might make sense to start 
> the default cluster at the top of the function once
> 
>    if (!isup(bpparam())) {
>        bpstart(bpparam())
>        on.exit(bpstop(bpparam()))
>    }

This is probably a good idea to do in general to all of my parallelized 
functions, though I don’t know how much this will solve the time problem. 
Perhaps I should just do it and see.

-A

> On 1/6/19, 11:16 PM, "Bioc-devel on behalf of Aaron Lun" 
> <[email protected] on behalf of 
> [email protected]> wrote:
> 
>    As we know, the default BiocParallel backends are currently set to 
> MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to 
> some extent because a new user running, say, bplapply() without additional 
> arguments or set-up would expect some kind of parallelization. However, from 
> a developer’s perspective, I would argue that it makes more sense to use 
> SerialParam() by default. 
> 
>    1. It avoids problems with MulticoreParam stalling (especially on Macs) 
> when the randomly chosen port is in already use. This used to be a major 
> problem, to the point that all my BiocParallel-using functions in scran 
> passed BPPARAM=SerialParam() by default. Setting SerialParam() as package 
> default would ensure BiocParallel functions run properly in the first place; 
> if the code stalls due to switching to MulticoreParam, then it’s obvious 
> where the problem lies (and how to fix it).
> 
>    2. It avoids the alteration of the random seed when the MulticoreParam 
> instance is constructed for the first time. 
> 
>    library(BiocParallel) # new R session
>    set.seed(100)
>    invisible(bplapply(1:5, identity))
>    rnorm(1) # 0.1315312
>    set.seed(100)
>    invisible(bplapply(1:5, identity))
>    rnorm(1) # -0.5021924
> 
>    This is because the first bplapply() call calls bpparam(), which 
> constructs a MulticoreParam() for the first time; this calls the PRNG to 
> choose a random port number. Ensuing random numbers are altered, as seen 
> above. To avoid this, I need to define the MulticoreParam() object prior to 
> set.seed(), which undermines the utility of a default-defined bpparam().
> 
>    3. Job dispatch via SnowParam() is quite slow, which potentially makes 
> Windows package builds run slower by default. A particularly bad example is 
> that of scran::fastMNN(), which has a few matrix multiplications that use 
> DelayedArray:%*%. The %*% is parallelized with the default bpparam(), 
> resulting in SNOW parallelization on Windows. This slowed down fastMNN()’s 
> examples from 4 seconds (unix) to ~100 seconds (windows). Clearly, serial 
> execution is the faster option here. A related problem is MulticoreParam()’s 
> tendency to copy the environment, which may result in problems from inflated 
> memory consumption.
> 
>    So, can we default to SerialParam() on all platforms? And by this I mean 
> the BiocParallel in-built default - I don’t want to have to instruct all my 
> users to put a “register(SerialParam())” at the start of their analysis 
> scripts. I feel that BiocParallel’s job is to provide downstream code with 
> the potential for parallelization. If end-users want actual parallelization, 
> they had better be prepared to specify an appropriate scheme via *Param() 
> objects. 
> 
>    -A
> 
> 
> 
> 
>       [[alternative HTML version deleted]]
> 
>    _______________________________________________
>    [email protected] mailing list
>    https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 

_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to