Hi Llolsten, the problem is not a firewall issue. The simplest way to reproduce the problem is rebooting a node in the middle of the job. It's possible to configure the openmpi to not terminate the job if, in the middle of the job, one node is rebooted?
Thanks again for your help. Regards, Guilherme On Mon, May 16, 2016 at 12:11 PM, Llolsten Kaonga <l...@soft-forge.com> wrote: > Hello Guilherme, > > > > This may be off but try running your mpirun command with the option > “–tag-output”. > If you see a “broken pipe”, then your issue may be firewall related. You > could then check the thread “*Re: [OMPI users] mpirun command won't run > unless the firewalld daemon is disabled*” for how to get around this from > Gilles or Jeff. > > > > I thank you. > > -- > > Llolsten > > > > *From:* users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Zabiziz > Zaz > *Sent:* Monday, May 16, 2016 10:46 AM > *To:* us...@open-mpi.org > *Subject:* [OMPI users] ORTE has lost communication > > > > Hi, > > I'm using openmpi-1.10.2 and sometimes I'm receiving the message below: > > -------------------------------------------------------------------------- > > ORTE has lost communication with its daemon located on node: > > > > hostname: xxxx > > > > This is usually due to either a failure of the TCP network > > connection to the node, or possibly an internal failure of > > the daemon itself. We cannot recover from this failure, and > > therefore will terminate the job. > > > > -------------------------------------------------------------------------- > > > > My applications are fault tolerant and the jobs usually takes weeks to > finish. Sometimes a hardware problem occurs with one node, for example, the > node shutdown. I don't want mpi to terminate the job, my jobs usually have > hundreds of nodes and I don't care if 1 node lost communication. > > > > It's possible to change this behavior of openmpi? I tried to > set orte_abort_on_non_zero_status to 0 but it didn't work. > > > > Thanks for your help. > > > > Regards, > > Guilherme. > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29214.php >