Hello Guilherme,

 

This may be off but try running your mpirun command with the option 
“–tag-output”. If you see a “broken pipe”, then your issue may be firewall 
related. You could then check the thread “Re: [OMPI users] mpirun command won't 
run unless the firewalld daemon is disabled” for how to get around this from 
Gilles or Jeff.

 

I thank you.

--

Llolsten

 

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Zabiziz Zaz
Sent: Monday, May 16, 2016 10:46 AM
To: us...@open-mpi.org
Subject: [OMPI users] ORTE has lost communication

 

Hi,

I'm using openmpi-1.10.2 and sometimes I'm receiving the message below:

--------------------------------------------------------------------------

ORTE has lost communication with its daemon located on node:

 

  hostname:  xxxx

 

This is usually due to either a failure of the TCP network

connection to the node, or possibly an internal failure of

the daemon itself. We cannot recover from this failure, and

therefore will terminate the job.

 

--------------------------------------------------------------------------

 

My applications are fault tolerant and the jobs usually takes weeks to finish. 
Sometimes a hardware problem occurs with one node, for example, the node 
shutdown. I don't want mpi to terminate the job, my jobs usually have hundreds 
of nodes and I don't care if 1 node lost communication.

 

It's possible to change this behavior of openmpi? I tried to set 
orte_abort_on_non_zero_status to 0 but it didn't work. 

 

Thanks for your help.

 

Regards,

Guilherme.

Reply via email to