My application have a heartbeat that checks if a node is alive and can
redistribute a task to another node if the master lost communication with
it. The application also have a checkpoint/restart, but since I usually
have hundreds of nodes for one job and usually takes a long time to restart
the job, in this case I would prefer to go on with the job and not
terminate it. It's possible?

Regards,
Guilherme.

On Mon, May 16, 2016 at 12:28 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> What do you mean by fault tolerant application ?
> from an OpenMPI point of view, if such a connection is lost, your
> application will no more be able to communicate, so killing it is the best
> option.
> if your application has built in checkpoint/restart, then you have to
> restart it with mpirun after the first mpirun aborts and your environment
> is fixed.
> or your batch manager should restart/resubmit the job, possibly on a
> different set of nodes.
>
> makes sense ?
>
> Cheers,
>
> Gilles
>
> On Monday, May 16, 2016, Zabiziz Zaz <zabi...@gmail.com> wrote:
>
>> Hi,
>> I'm using openmpi-1.10.2 and sometimes I'm receiving the message below:
>> --------------------------------------------------------------------------
>> ORTE has lost communication with its daemon located on node:
>>
>>   hostname:  xxxx
>>
>> This is usually due to either a failure of the TCP network
>> connection to the node, or possibly an internal failure of
>> the daemon itself. We cannot recover from this failure, and
>> therefore will terminate the job.
>>
>> --------------------------------------------------------------------------
>>
>> My applications are fault tolerant and the jobs usually takes weeks to
>> finish. Sometimes a hardware problem occurs with one node, for example, the
>> node shutdown. I don't want mpi to terminate the job, my jobs usually have
>> hundreds of nodes and I don't care if 1 node lost communication.
>>
>> It's possible to change this behavior of openmpi? I tried to
>> set orte_abort_on_non_zero_status to 0 but it didn't work.
>>
>> Thanks for your help.
>>
>> Regards,
>> Guilherme.
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29219.php
>

Reply via email to