Hello Guilherme,
This may be off but try running your mpirun command with the option “–tag-output”. If you see a “broken pipe”, then your issue may be firewall related. You could then check the thread “Re: [OMPI users] mpirun command won't run unless the firewalld daemon is disabled” for how to get around this from Gilles or Jeff. I thank you. -- Llolsten From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Zabiziz Zaz Sent: Monday, May 16, 2016 10:46 AM To: us...@open-mpi.org Subject: [OMPI users] ORTE has lost communication Hi, I'm using openmpi-1.10.2 and sometimes I'm receiving the message below: -------------------------------------------------------------------------- ORTE has lost communication with its daemon located on node: hostname: xxxx This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -------------------------------------------------------------------------- My applications are fault tolerant and the jobs usually takes weeks to finish. Sometimes a hardware problem occurs with one node, for example, the node shutdown. I don't want mpi to terminate the job, my jobs usually have hundreds of nodes and I don't care if 1 node lost communication. It's possible to change this behavior of openmpi? I tried to set orte_abort_on_non_zero_status to 0 but it didn't work. Thanks for your help. Regards, Guilherme.