On 01/10/2015 10:24, Emyr James wrote:

"ORTE has lost communication with its daemon located on node:

  hostname:  node123

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job."

Is this issue related to https://www.open-mpi.org/faq/?category=troubleshooting#large-job-tcp-oob-timeout ? I'm in discussion with our system managers to increase the relevant kernel parameters as we have the default settings at the moment.
Cheers,

Emyr





--
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Reply via email to