Ok. Could you please tell me the latest version that is supported? Regards, Guilherme.
On Mon, May 16, 2016 at 12:30 PM, Ralph Castain <r...@open-mpi.org> wrote: > We used to do so, but don’t currently support that model - folks are > working on restoring it. No timetable, though I don’t think it will be too > much longer before it is in master. Can’t say when it will hit release > > On May 16, 2016, at 8:25 AM, Zabiziz Zaz <zabi...@gmail.com> wrote: > > Hi Llolsten, > the problem is not a firewall issue. The simplest way to reproduce the > problem is rebooting a node in the middle of the job. It's possible to > configure the openmpi to not terminate the job if, in the middle of the > job, one node is rebooted? > > Thanks again for your help. > > Regards, > Guilherme > > On Mon, May 16, 2016 at 12:11 PM, Llolsten Kaonga <l...@soft-forge.com> > wrote: > >> Hello Guilherme, >> >> >> >> This may be off but try running your mpirun command with the option >> “–tag-output”. >> If you see a “broken pipe”, then your issue may be firewall related. You >> could then check the thread “*Re: [OMPI users] mpirun command won't run >> unless the firewalld daemon is disabled*” for how to get around this >> from Gilles or Jeff. >> >> >> >> I thank you. >> >> -- >> >> Llolsten >> >> >> >> *From:* users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Zabiziz >> Zaz >> *Sent:* Monday, May 16, 2016 10:46 AM >> *To:* us...@open-mpi.org >> *Subject:* [OMPI users] ORTE has lost communication >> >> >> >> Hi, >> >> I'm using openmpi-1.10.2 and sometimes I'm receiving the message below: >> >> -------------------------------------------------------------------------- >> >> ORTE has lost communication with its daemon located on node: >> >> >> >> hostname: xxxx >> >> >> >> This is usually due to either a failure of the TCP network >> >> connection to the node, or possibly an internal failure of >> >> the daemon itself. We cannot recover from this failure, and >> >> therefore will terminate the job. >> >> >> >> -------------------------------------------------------------------------- >> >> >> >> My applications are fault tolerant and the jobs usually takes weeks to >> finish. Sometimes a hardware problem occurs with one node, for example, the >> node shutdown. I don't want mpi to terminate the job, my jobs usually have >> hundreds of nodes and I don't care if 1 node lost communication. >> >> >> >> It's possible to change this behavior of openmpi? I tried to >> set orte_abort_on_non_zero_status to 0 but it didn't work. >> >> >> >> Thanks for your help. >> >> >> >> Regards, >> >> Guilherme. >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/05/29214.php >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29218.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29220.php >