Hi,
I'm using openmpi-1.10.2 and sometimes I'm receiving the message below:
--------------------------------------------------------------------------
ORTE has lost communication with its daemon located on node:

  hostname:  xxxx

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.

--------------------------------------------------------------------------

My applications are fault tolerant and the jobs usually takes weeks to
finish. Sometimes a hardware problem occurs with one node, for example, the
node shutdown. I don't want mpi to terminate the job, my jobs usually have
hundreds of nodes and I don't care if 1 node lost communication.

It's possible to change this behavior of openmpi? I tried to
set orte_abort_on_non_zero_status to 0 but it didn't work.

Thanks for your help.

Regards,
Guilherme.

Reply via email to