Re: [OMPI users] ORTE has lost communication

Ralph Castain Mon, 16 May 2016 11:30:11 -0400 (EDT)

We used to do so, but don’t currently support that model - folks are working on 
restoring it. No timetable, though I don’t think it will be too much longer 
before it is in master. Can’t say when it will hit release


> On May 16, 2016, at 8:25 AM, Zabiziz Zaz <zabi...@gmail.com> wrote:
> 
> Hi Llolsten,
> the problem is not a firewall issue. The simplest way to reproduce the 
> problem is rebooting a node in the middle of the job. It's possible to 
> configure the openmpi to not terminate the job if, in the middle of the job, 
> one node is rebooted?
> 
> Thanks again for your help.
> 
> Regards,
> Guilherme
> 
> On Mon, May 16, 2016 at 12:11 PM, Llolsten Kaonga <l...@soft-forge.com 
> <mailto:l...@soft-forge.com>> wrote:
> Hello Guilherme,
> 
>  
> 
> This may be off but try running your mpirun command with the option 
> “–tag-output”. If you see a “broken pipe”, then your issue may be firewall 
> related. You could then check the thread “Re: [OMPI users] mpirun command 
> won't run unless the firewalld daemon is disabled” for how to get around this 
> from Gilles or Jeff.
> 
>  
> 
> I thank you.
> 
> --
> 
> Llolsten
> 
>   <>
> From: users [mailto:users-boun...@open-mpi.org 
> <mailto:users-boun...@open-mpi.org>] On Behalf Of Zabiziz Zaz
> Sent: Monday, May 16, 2016 10:46 AM
> To: us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subject: [OMPI users] ORTE has lost communication
> 
>  
> 
> Hi,
> 
> I'm using openmpi-1.10.2 and sometimes I'm receiving the message below:
> 
> --------------------------------------------------------------------------
> 
> ORTE has lost communication with its daemon located on node:
> 
>  
> 
>   hostname:  xxxx
> 
>  
> 
> This is usually due to either a failure of the TCP network
> 
> connection to the node, or possibly an internal failure of
> 
> the daemon itself. We cannot recover from this failure, and
> 
> therefore will terminate the job.
> 
>  
> 
> --------------------------------------------------------------------------
> 
>  
> 
> My applications are fault tolerant and the jobs usually takes weeks to 
> finish. Sometimes a hardware problem occurs with one node, for example, the 
> node shutdown. I don't want mpi to terminate the job, my jobs usually have 
> hundreds of nodes and I don't care if 1 node lost communication.
> 
>  
> 
> It's possible to change this behavior of openmpi? I tried to set 
> orte_abort_on_non_zero_status to 0 but it didn't work. 
> 
>  
> 
> Thanks for your help.
> 
>  
> 
> Regards,
> 
> Guilherme.
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29214.php 
> <http://www.open-mpi.org/community/lists/users/2016/05/29214.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29218.php

Re: [OMPI users] ORTE has lost communication

Reply via email to