Hi, I'm using openmpi-1.10.2 and sometimes I'm receiving the message below: -------------------------------------------------------------------------- ORTE has lost communication with its daemon located on node:
hostname: xxxx This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -------------------------------------------------------------------------- My applications are fault tolerant and the jobs usually takes weeks to finish. Sometimes a hardware problem occurs with one node, for example, the node shutdown. I don't want mpi to terminate the job, my jobs usually have hundreds of nodes and I don't care if 1 node lost communication. It's possible to change this behavior of openmpi? I tried to set orte_abort_on_non_zero_status to 0 but it didn't work. Thanks for your help. Regards, Guilherme.