Summary: at the end of this message is a link to an R package implementing an interface for managing the use of execution units in R packages. As a package maintainer, would you agree to use something like this? Does it look sufficiently reasonable to become a part of R? Read on for why I made these particular interface choices.
My understanding of the problem stated by Simon Urbanek and Uwe Ligges [1,2] is that we need a way to set and distribute the CPU core allowance between multiple packages that could be using very different methods to achieve parallel execution on the local machine, including threads and child processes. We could have multiple well-meaning packages, each of them calling each other using a different parallelism technology: imagine parallel::makeCluster(getOption('mc.cores')) combined with parallel::mclapply(mc.cores = getOption('mc.cores')) and with an OpenMP program that also spawns getOption('mc.cores') threads. A parallel BLAS or custom multi-threading using std::thread could add more fuel to the fire. Workarounds applied by the package maintainers nowadays are both cumbersome (sometimes one has to talk to some package that lives downstream in the call stack and isn't even an explicit dependency, because it's the one responsible for the threads) and not really enough (most maintainers forget to restore the state after they are done, so a single example() may slow down the operations that follow). The problem is complicated by the fact that not every parallel operation can explicitly accept the CPU core limit as a parameter. For example, data.table's implicit parallelism is very convenient, and so are parallel BLASes (which don't have a standard interface to change the number of threads), so we shouldn't be prohibiting implicit parallelism. It's also not always obvious how to split the cores between the potentially parallel sections. While it's typically best to start with the outer loop (e.g. better have 16 R processes solving relatively small linear algebra problems back to back than have one R process spinning 15 of its 16 OpenBLAS threads in sched_yield()), it may be more efficient to give all 16 threads back to BLAS (and save on transferring the problems and solutions between processes) once the problems become large enough to give enough work to all of the cores. So as a user, I would like an interface that would both let me give all of the cores to the program if that's what I need (something like setCPUallowance(parallelly::availableCores())) _and_ let me be more detailed when necessary (something like setCPUallowance(overall = 7, packages = c(foobar = 1), BLAS = 2) to limit BLAS threads to 2, disallow parallelism in the foobar package because it wastes too much time, and limit R as a whole to 7 cores because I want to surf the 'net on the remaining one while the Monte-Carlo simulation is going on). As a package developer, I'd rather not think about any of that and just use a function call like getCPUallowance() for the default number of cores in every situation. Can we implement such an interface? The main obstacle here is not being able to know when each parallel region beings and ends. Does the package call fork()? std::thread{}? Start a local mirai cluster? We have to trust (and verify during R CMD check) the package to create the given number of units of execution and tells us when they are done. The closest interface that I see being implementable is a system of tokens with reference semantics: getCPUallowance() returns a special object containing the number of tokens the caller is allowed to use and sets an environment variable with the remaining number of cores. Any R child processes pick up the number of cores from the environment variable. Any downstream calls to getCPUallowance(), aware of the tokens already handed out, return a reduced number of remaining CPU cores. Once the package is done executing a parallel section, it returns the CPU allowance back to R by calling something like close(token), which updates the internal allowance value (and the environment variable). (A finalizer can also be set on the tokens to ensure that CPU cores won't be lost.) Here's a package implementing this idea: <https://codeberg.org/aitap/R-CPUallowance>. Currently missing are terrible hacks to determine the BLAS type at runtime and resolve the necessary symbols to set the number of BLAS threads, depending on whether it's OpenBLAS, flexiblas, MKL, or something else. Does it feel over-engineered? I hope that, even if not a good solution, this would let us move towards a unified solution that could just work⢠on everything ranging from laptops to CRAN testing machines to HPCs. -- Best regards, Ivan [1] https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009484.html [2] https://stat.ethz.ch/pipermail/r-package-devel/2023q3/009513.html ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel