We used to do so, but don’t currently support that model - folks are working on restoring it. No timetable, though I don’t think it will be too much longer before it is in master. Can’t say when it will hit release
> On May 16, 2016, at 8:25 AM, Zabiziz Zaz <zabi...@gmail.com> wrote: > > Hi Llolsten, > the problem is not a firewall issue. The simplest way to reproduce the > problem is rebooting a node in the middle of the job. It's possible to > configure the openmpi to not terminate the job if, in the middle of the job, > one node is rebooted? > > Thanks again for your help. > > Regards, > Guilherme > > On Mon, May 16, 2016 at 12:11 PM, Llolsten Kaonga <l...@soft-forge.com > <mailto:l...@soft-forge.com>> wrote: > Hello Guilherme, > > > > This may be off but try running your mpirun command with the option > “–tag-output”. If you see a “broken pipe”, then your issue may be firewall > related. You could then check the thread “Re: [OMPI users] mpirun command > won't run unless the firewalld daemon is disabled” for how to get around this > from Gilles or Jeff. > > > > I thank you. > > -- > > Llolsten > > <> > From: users [mailto:users-boun...@open-mpi.org > <mailto:users-boun...@open-mpi.org>] On Behalf Of Zabiziz Zaz > Sent: Monday, May 16, 2016 10:46 AM > To: us...@open-mpi.org <mailto:us...@open-mpi.org> > Subject: [OMPI users] ORTE has lost communication > > > > Hi, > > I'm using openmpi-1.10.2 and sometimes I'm receiving the message below: > > -------------------------------------------------------------------------- > > ORTE has lost communication with its daemon located on node: > > > > hostname: xxxx > > > > This is usually due to either a failure of the TCP network > > connection to the node, or possibly an internal failure of > > the daemon itself. We cannot recover from this failure, and > > therefore will terminate the job. > > > > -------------------------------------------------------------------------- > > > > My applications are fault tolerant and the jobs usually takes weeks to > finish. Sometimes a hardware problem occurs with one node, for example, the > node shutdown. I don't want mpi to terminate the job, my jobs usually have > hundreds of nodes and I don't care if 1 node lost communication. > > > > It's possible to change this behavior of openmpi? I tried to set > orte_abort_on_non_zero_status to 0 but it didn't work. > > > > Thanks for your help. > > > > Regards, > > Guilherme. > > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > <https://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29214.php > <http://www.open-mpi.org/community/lists/users/2016/05/29214.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29218.php