Great. For the record of this thread, I've submitted patch PR17391 (https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17391). I've patched it against the latest R-devel on the SVN, passes 'make check-all', and I've verified it works with the above tests.
/Henrik On Fri, Mar 9, 2018 at 4:37 AM, <luke-tier...@uiowa.edu> wrote: > I'm happy to look at a patch that does this. I'd start with a small > interval and increase it by 50%, say, on each try wit a max retry time > limit. This isn't eliminating the problem,only reducing the > probability, but still worth it. I had considered doing something like > this but it didn't seem necessary at the time. You don't want to retry > indefinitely since the connection could be failing because the master > died, and then you want the workers to die as well. > > Best, > > luke > > > On Fri, 9 Mar 2018, Henrik Bengtsson wrote: > >> A solution is to have parallel:::.slaveRSOCK() attempt to connect >> multiple times before failing, e.g. >> >> makeSOCKmaster <- function(master, port, timeout, useXDR, maxTries >> = 10L, interval = 1.0) { >> port <- as.integer(port) >> for (i in seq_len(maxTries)) { >> con <- tryCatch({ >> socketConnection(master, port = port, blocking = TRUE, >> open = "a+b", timeout = timeout) >> }, error = identity) >> if (inherits(con, "connection")) break >> Sys.sleep(interval) >> } >> if (inherits(con, "error")) stop(con) >> structure(list(con = con), class = if (useXDR) >> "SOCKnode" >> else "SOCK0node") >> } >> >> One could set 'maxTries' and 'interval' via commandArgs() like what is >> done for the other arguments. >> >> I'm happy to submit an SVN patch if R core thinks this is an >> acceptable solution. >> >> /Henrik >> >> On Thu, Mar 8, 2018 at 7:36 PM, Henrik Bengtsson >> <henrik.bengts...@gmail.com> wrote: >>> >>> I just noticed that parallel:::.slaveRSOCK() passes 'timeout' to >>> socketConnection() as a character, i.e. there's a missing timeout <- >>> as.integer(timeout), cf. port <- as.integer(port) and useXDR <- >>> as.logical(value): >>> >>>> parallel:::.slaveRSOCK >>> >>> function () >>> { >>> makeSOCKmaster <- function(master, port, timeout, useXDR) { >>> port <- as.integer(port) >>> con <- socketConnection(master, port = port, blocking = TRUE, >>> open = "a+b", timeout = timeout) >>> structure(list(con = con), class = if (useXDR) >>> "SOCKnode" >>> else "SOCK0node") >>> } >>> master <- "localhost" >>> port <- NA_integer_ >>> outfile <- Sys.getenv("R_SNOW_OUTFILE") >>> methods <- TRUE >>> useXDR <- TRUE >>> for (a in commandArgs(TRUE)) { >>> pos <- regexpr("=", a) >>> name <- substr(a, 1L, pos - 1L) >>> value <- substr(a, pos + 1L, nchar(a)) >>> switch(name, MASTER = { >>> master <- value >>> }, PORT = { >>> port <- value >>> }, OUT = { >>> outfile <- value >>> }, TIMEOUT = { >>> timeout <- value >>> }, XDR = { >>> useXDR <- as.logical(value) >>> }) >>> } >>> if (is.na(port)) >>> stop("PORT must be specified") >>> sinkWorkerOutput(outfile) >>> msg <- sprintf("starting worker pid=%d on %s at %s\n", Sys.getpid(), >>> paste(master, port, sep = ":"), format(Sys.time(), "%H:%M:%OS3")) >>> cat(msg) >>> slaveLoop(makeSOCKmaster(master, port, timeout, useXDR)) >>> } >>> <bytecode: 0x4bd4b58> >>> <environment: namespace:parallel> >>> >>> Yet, fix that does *not* seem to change anything. >>> >>> /Henrik >>> >>> On Thu, Mar 8, 2018 at 7:05 PM, Henrik Bengtsson >>> <henrik.bengts...@gmail.com> wrote: >>>> >>>> BACKGROUND: >>>> >>>> While troubleshooting random, occasionally occurring, errors from >>>> parallel::makePSOCKcluster("localhost", port = 11000); >>>> >>>> Error in socketConnection("localhost", port = port, server = TRUE, >>>> blocking = TRUE, : >>>> cannot open the connection >>>> >>>> I had another look at parallel:::newPSOCKnode(), which is used >>>> internally to set up each background worker. It is designed to, first >>>> launch the background worker as: >>>> >>>> system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args >>>> MASTER=localhost PORT=11000 OUT=/dev/null TIMEOUT=2592000 XDR=TRUE', >>>> wait = FALSE) >>>> >>>> which immediately tries to connect to a socket on localhost:11000 with >>>> timeout. Immediately after the master launched the above (without >>>> waiting), it will set up the connection waiting for the connect from >>>> the background worker: >>>> >>>> con <- socketConnection("localhost", port = 11000, server = TRUE, >>>> blocking = TRUE, open = "a+b", timeout = timeout) >>>> >>>> >>>> ISSUE: >>>> >>>> If we emulate the above process, and remove the OUT=/dev/null such >>>> that we can see the output produces by the worker, as: >>>> >>>> setup <- function(delay = 0) { >>>> system('R --slave --no-restore -e "parallel:::.slaveRSOCK()" --args >>>> MASTER=localhost PORT=11000 TIMEOUT=2592000 XDR=TRUE', wait = FALSE) >>>> Sys.sleep(delay) >>>> socketConnection("localhost", port = 11000, server = TRUE, blocking >>>> = TRUE, open = "a+b", timeout = 20) >>>> } >>>> >>>> doing: >>>> >>>>> con <- setup(0) >>>> >>>> starting worker pid=24983 on localhost:11000 at 18:44:30.087 >>>> >>>> will most likely work, but adding a delay: >>>> >>>>> con <- setup(5) >>>> >>>> starting worker pid=25099 on localhost:11000 at 18:45:23.617 >>>> Warning in socketConnection(master, port = port, blocking = TRUE, open >>>> = "a+b", : >>>> localhost:11000 cannot be opened >>>> Error in socketConnection(master, port = port, blocking = TRUE, open = >>>> "a+b", : >>>> cannot open the connection >>>> Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> >>>> socketConnection >>>> >>>> will produce an *instant* error on the worker, and before master opens >>>> the server socket. Eventually, master will produce the originally >>>> observed error: >>>> >>>> Error in socketConnection("localhost", port = 11000, server = TRUE, >>>> blocking = TRUE, : >>>> cannot open the connection >>>> >>>> In other words, if the master fails to setup socketConnection() >>>> *before* the background workers attempts to connect, it all fails. >>>> Such a delay may happen for instance when there is a large CPU load on >>>> the test machine. >>>> >>>> Is this the above bug? >>>> >>>> /Henrik >>>> >>>> PS. The background is that I, very occasionally, observe R CMD check >>>> error on the above (on CRAN and elsewhere) when testing my future >>>> package. The error always go away when retested. This far I've though >>>> this is due to port clashes (since the port is random selected in >>>> [11000:11999]) and accepted that it happens. However, after >>>> discovering the above, it could be due to the worker launching "too >>>> soon". >>>> >>>>> sessionInfo() >>>> >>>> R version 3.4.3 (2017-11-30) >>>> Platform: x86_64-pc-linux-gnu (64-bit) >>>> Running under: Ubuntu 16.04.4 LTS >>>> >>>> Matrix products: default >>>> BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0 >>>> LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0 >>>> >>>> locale: >>>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> >>>> attached base packages: >>>> [1] stats graphics grDevices utils datasets methods base >>>> >>>> loaded via a namespace (and not attached): >>>> [1] compiler_3.4.3 >> >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics and Fax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tier...@uiowa.edu > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel