A solution is to have parallel:::.slaveRSOCK() attempt to connect multiple times before failing, e.g.
makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries = 10L, interval = 1.0) { port <- as.integer(port) for (i in seq_len(maxTries)) { con <- tryCatch({ socketConnection(master, port = port, blocking = TRUE, open = "a+b", timeout = timeout) }, error = identity) if (inherits(con, "connection")) break Sys.sleep(interval) } if (inherits(con, "error")) stop(con) structure(list(con = con), class = if (useXDR) "SOCKnode" else "SOCK0node") } One could set 'maxTries' and 'interval' via commandArgs() like what is done for the other arguments. I'm happy to submit an SVN patch if R core thinks this is an acceptable solution. /Henrik On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson <henrik.bengts...@gmail.com> wrote: > I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to > socketConnection() as a character, i.e. there's a missing timeout <- > as.integer(timeout), cf. port <- as.integer(port) and useXDR <- > as.logical(value): > >> parallel:::.slaveRSOCK > function () > { > makeSOCKmaster <- function(master, port, timeout, useXDR) { > port <- as.integer(port) > con <- socketConnection(master, port = port, blocking = TRUE, > open = "a+b", timeout = timeout) > structure(list(con = con), class = if (useXDR) > "SOCKnode" > else "SOCK0node") > } > master <- "localhost" > port <- NA_integer_ > outfile <- Sys.getenv("R_SNOW_OUTFILE") > methods <- TRUE > useXDR <- TRUE > for (a in commandArgs(TRUE)) { > pos <- regexpr("=", a) > name <- substr(a, 1L, pos - 1L) > value <- substr(a, pos + 1L, nchar(a)) > switch(name, MASTER = { > master <- value > }, PORT = { > port <- value > }, OUT = { > outfile <- value > }, TIMEOUT = { > timeout <- value > }, XDR = { > useXDR <- as.logical(value) > }) > } > if (is.na(port)) > stop("PORT must be specified") > sinkWorkerOutput(outfile) > msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(), > paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3")) > cat(msg) > slaveLoop(makeSOCKmaster(master, port, timeout, useXDR)) > } > <bytecode: 0x4bd4b58> > <environment: namespace:parallel> > > Yet, fix that does *not* seem to change anything. > > /Henrik > > On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson > <henrik.bengts...@gmail.com> wrote: >> BACKGROUND: >> >> While troubleshooting random, occasionally occurring, errors from >> parallel::makePSOCKcluster("localhost", port = 11000); >> >> Error in socketConnection("localhost", port = port, server = TRUE, >> blocking = TRUE, : >> cannot open the connection >> >> I had another look at parallel:::newPSOCKnode(), which is used >> internally to set up each background worker. It is designed to, first >> launch the background worker as: >> >> system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args >> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE', >> wait = FALSE) >> >> which immediately tries to connect to a socket on localhost:11000 with >> timeout. Immediately after the master launched the above (without >> waiting), it will set up the connection waiting for the connect from >> the background worker: >> >> con <- socketConnection("localhost", port = 11000, server = TRUE, >> blocking = TRUE, open = "a+b", timeout = timeout) >> >> >> ISSUE: >> >> If we emulate the above process, and remove the OUT=/dev/null such >> that we can see the output produces by the worker, as: >> >> setup <- function(delay = 0) { >> system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args >> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE) >> Sys.sleep(delay) >> socketConnection("localhost", port = 11000, server = TRUE, blocking >> = TRUE, open = "a+b", timeout = 20) >> } >> >> doing: >> >>> con <- setup(0) >> starting worker pid=24983 on localhost:11000 at 18:44:30.087 >> >> will most likely work, but adding a delay: >> >>> con <- setup(5) >> starting worker pid=25099 on localhost:11000 at 18:45:23.617 >> Warning in socketConnection(master, port = port, blocking = TRUE, open >> = "a+b", : >> localhost:11000 cannot be opened >> Error in socketConnection(master, port = port, blocking = TRUE, open = >> "a+b", : >> cannot open the connection >> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> >> socketConnection >> >> will produce an *instant* error on the worker, and before master opens >> the server socket. Eventually, master will produce the originally >> observed error: >> >> Error in socketConnection("localhost", port = 11000, server = TRUE, >> blocking = TRUE, : >> cannot open the connection >> >> In other words, if the master fails to setup socketConnection() >> *before* the background workers attempts to connect, it all fails. >> Such a delay may happen for instance when there is a large CPU load on >> the test machine. >> >> Is this the above bug? >> >> /Henrik >> >> PS. The background is that I, very occasionally, observe R CMD check >> error on the above (on CRAN and elsewhere) when testing my future >> package. The error always go away when retested. This far I've though >> this is due to port clashes (since the port is random selected in >> [11000:11999]) and accepted that it happens. However, after >> discovering the above, it could be due to the worker launching "too >> soon". >> >>> sessionInfo() >> R version 3.4.3 (2017-11-30) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu 16.04.4 LTS >> >> Matrix products: default >> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0 >> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0 >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] compiler_3.4.3 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel