As we know, the default BiocParallel backends are currently set to
MulticoreParam (Linux/Mac) or SnowParam (Windows). I can understand this to
some extent because a new user running, say, bplapply() without additional
arguments or set-up would expect some kind of parallelization. However, from a
developer’s perspective, I would argue that it makes more sense to use
SerialParam() by default.
1. It avoids problems with MulticoreParam stalling (especially on Macs) when
the randomly chosen port is in already use. This used to be a major problem, to
the point that all my BiocParallel-using functions in scran passed
BPPARAM=SerialParam() by default. Setting SerialParam() as package default
would ensure BiocParallel functions run properly in the first place; if the
code stalls due to switching to MulticoreParam, then it’s obvious where the
problem lies (and how to fix it).
2. It avoids the alteration of the random seed when the MulticoreParam instance
is constructed for the first time.
library(BiocParallel) # new R session
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # 0.1315312
set.seed(100)
invisible(bplapply(1:5, identity))
rnorm(1) # -0.5021924
This is because the first bplapply() call calls bpparam(), which constructs a
MulticoreParam() for the first time; this calls the PRNG to choose a random
port number. Ensuing random numbers are altered, as seen above. To avoid this,
I need to define the MulticoreParam() object prior to set.seed(), which
undermines the utility of a default-defined bpparam().
3. Job dispatch via SnowParam() is quite slow, which potentially makes Windows
package builds run slower by default. A particularly bad example is that of
scran::fastMNN(), which has a few matrix multiplications that use
DelayedArray:%*%. The %*% is parallelized with the default bpparam(), resulting
in SNOW parallelization on Windows. This slowed down fastMNN()’s examples from
4 seconds (unix) to ~100 seconds (windows). Clearly, serial execution is the
faster option here. A related problem is MulticoreParam()’s tendency to copy
the environment, which may result in problems from inflated memory consumption.
So, can we default to SerialParam() on all platforms? And by this I mean the
BiocParallel in-built default - I don’t want to have to instruct all my users
to put a “register(SerialParam())” at the start of their analysis scripts. I
feel that BiocParallel’s job is to provide downstream code with the potential
for parallelization. If end-users want actual parallelization, they had better
be prepared to specify an appropriate scheme via *Param() objects.
-A
[[alternative HTML version deleted]]
_______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel