Ok.
Could you please tell me the latest version that is supported?

Regards,
Guilherme.

On Mon, May 16, 2016 at 12:30 PM, Ralph Castain <r...@open-mpi.org> wrote:

> We used to do so, but don’t currently support that model - folks are
> working on restoring it. No timetable, though I don’t think it will be too
> much longer before it is in master. Can’t say when it will hit release
>
> On May 16, 2016, at 8:25 AM, Zabiziz Zaz <zabi...@gmail.com> wrote:
>
> Hi Llolsten,
> the problem is not a firewall issue. The simplest way to reproduce the
> problem is rebooting a node in the middle of the job. It's possible to
> configure the openmpi to not terminate the job if, in the middle of the
> job, one node is rebooted?
>
> Thanks again for your help.
>
> Regards,
> Guilherme
>
> On Mon, May 16, 2016 at 12:11 PM, Llolsten Kaonga <l...@soft-forge.com>
> wrote:
>
>> Hello Guilherme,
>>
>>
>>
>> This may be off but try running your mpirun command with the option 
>> “–tag-output”.
>> If you see a “broken pipe”, then your issue may be firewall related. You
>> could then check the thread “*Re: [OMPI users] mpirun command won't run
>> unless the firewalld daemon is disabled*” for how to get around this
>> from Gilles or Jeff.
>>
>>
>>
>> I thank you.
>>
>> --
>>
>> Llolsten
>>
>>
>>
>> *From:* users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Zabiziz
>> Zaz
>> *Sent:* Monday, May 16, 2016 10:46 AM
>> *To:* us...@open-mpi.org
>> *Subject:* [OMPI users] ORTE has lost communication
>>
>>
>>
>> Hi,
>>
>> I'm using openmpi-1.10.2 and sometimes I'm receiving the message below:
>>
>> --------------------------------------------------------------------------
>>
>> ORTE has lost communication with its daemon located on node:
>>
>>
>>
>>   hostname:  xxxx
>>
>>
>>
>> This is usually due to either a failure of the TCP network
>>
>> connection to the node, or possibly an internal failure of
>>
>> the daemon itself. We cannot recover from this failure, and
>>
>> therefore will terminate the job.
>>
>>
>>
>> --------------------------------------------------------------------------
>>
>>
>>
>> My applications are fault tolerant and the jobs usually takes weeks to
>> finish. Sometimes a hardware problem occurs with one node, for example, the
>> node shutdown. I don't want mpi to terminate the job, my jobs usually have
>> hundreds of nodes and I don't care if 1 node lost communication.
>>
>>
>>
>> It's possible to change this behavior of openmpi? I tried to
>> set orte_abort_on_non_zero_status to 0 but it didn't work.
>>
>>
>>
>> Thanks for your help.
>>
>>
>>
>> Regards,
>>
>> Guilherme.
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/05/29214.php
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29218.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29220.php
>

Reply via email to