I have a process that I need to parallelize, and have a question about two different ways to proceed. It is essentially an MCMC exploration where the likelihood is a sum over subjects (6000 of them), and the per-subject computation is the slow part.
Here is a rough schematic of the code using one approach: mymc <- function(formula, data, subset, na.action, id, etc) { # lots of setup, long but computationally quick hlog <- function(thisid, param) { # compute the loglik for this subject ... } uid <- unique(id) # multiple data rows for each subject for (i in 1:burnin) { param <- get_next_proposal() loglist <- mclapply(uid, hlog, param=param) loglik <- sum(unlist(loglist)) # process result } # Now the non-burnin MCMC iterations � } The second approach is to put cluster formation outside the loop, e.g., ... clust <- makeForkCluster() for (i in 1:burnin) { param <- get_next_proposal() loglist <- parLapply(clust, uid, hlog, param=param) loglik <- sum(unlist(loglist)) # process result } # rest of the code stopCluster(clust) ------------------ On the face of it, the second looks like it "could" be more efficient since it only starts and stops the subprocesses once. A short trial on one of our cluster servers seems to say the opposite. The load average on a quiet machine never gets much over 5-6 using method 2, and in the 20s for method 1 (detectCores() =80 on the box, we used mc.cores=50). Wall time for method 2 is looking to be several hours. Any pointers to documentation/discussion at this level would be much appreciated. I'm going to be fitting a lot of models. Terry T. [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.