I honestly have no idea…

> On May 16, 2016, at 10:39 AM, Zabiziz Zaz <zabi...@gmail.com> wrote:
> 
> Ok.
> Could you please tell me the latest version that is supported?
> 
> Regards,
> Guilherme.
> 
> On Mon, May 16, 2016 at 12:30 PM, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> We used to do so, but don’t currently support that model - folks are working 
> on restoring it. No timetable, though I don’t think it will be too much 
> longer before it is in master. Can’t say when it will hit release
> 
>> On May 16, 2016, at 8:25 AM, Zabiziz Zaz <zabi...@gmail.com 
>> <mailto:zabi...@gmail.com>> wrote:
>> 
>> Hi Llolsten,
>> the problem is not a firewall issue. The simplest way to reproduce the 
>> problem is rebooting a node in the middle of the job. It's possible to 
>> configure the openmpi to not terminate the job if, in the middle of the job, 
>> one node is rebooted?
>> 
>> Thanks again for your help.
>> 
>> Regards,
>> Guilherme
>> 
>> On Mon, May 16, 2016 at 12:11 PM, Llolsten Kaonga <l...@soft-forge.com 
>> <mailto:l...@soft-forge.com>> wrote:
>> Hello Guilherme,
>> 
>>  
>> 
>> This may be off but try running your mpirun command with the option 
>> “–tag-output”. If you see a “broken pipe”, then your issue may be firewall 
>> related. You could then check the thread “Re: [OMPI users] mpirun command 
>> won't run unless the firewalld daemon is disabled” for how to get around 
>> this from Gilles or Jeff.
>> 
>>  
>> 
>> I thank you.
>> 
>> --
>> 
>> Llolsten
>> 
>>   <>
>> From: users [mailto:users-boun...@open-mpi.org 
>> <mailto:users-boun...@open-mpi.org>] On Behalf Of Zabiziz Zaz
>> Sent: Monday, May 16, 2016 10:46 AM
>> To: us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subject: [OMPI users] ORTE has lost communication
>> 
>>  
>> 
>> Hi,
>> 
>> I'm using openmpi-1.10.2 and sometimes I'm receiving the message below:
>> 
>> --------------------------------------------------------------------------
>> 
>> ORTE has lost communication with its daemon located on node:
>> 
>>  
>> 
>>   hostname:  xxxx
>> 
>>  
>> 
>> This is usually due to either a failure of the TCP network
>> 
>> connection to the node, or possibly an internal failure of
>> 
>> the daemon itself. We cannot recover from this failure, and
>> 
>> therefore will terminate the job.
>> 
>>  
>> 
>> --------------------------------------------------------------------------
>> 
>>  
>> 
>> My applications are fault tolerant and the jobs usually takes weeks to 
>> finish. Sometimes a hardware problem occurs with one node, for example, the 
>> node shutdown. I don't want mpi to terminate the job, my jobs usually have 
>> hundreds of nodes and I don't care if 1 node lost communication.
>> 
>>  
>> 
>> It's possible to change this behavior of openmpi? I tried to set 
>> orte_abort_on_non_zero_status to 0 but it didn't work. 
>> 
>>  
>> 
>> Thanks for your help.
>> 
>>  
>> 
>> Regards,
>> 
>> Guilherme.
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/05/29214.php 
>> <http://www.open-mpi.org/community/lists/users/2016/05/29214.php>
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
>> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/05/29218.php 
>> <http://www.open-mpi.org/community/lists/users/2016/05/29218.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org <mailto:us...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users 
> <https://www.open-mpi.org/mailman/listinfo.cgi/users>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29220.php 
> <http://www.open-mpi.org/community/lists/users/2016/05/29220.php>
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29225.php

Reply via email to