What do you mean by fault tolerant application ?
from an OpenMPI point of view, if such a connection is lost, your
application will no more be able to communicate, so killing it is the best
option.
if your application has built in checkpoint/restart, then you have to
restart it with mpirun after the first mpirun aborts and your environment
is fixed.
or your batch manager should restart/resubmit the job, possibly on a
different set of nodes.

makes sense ?

Cheers,

Gilles

On Monday, May 16, 2016, Zabiziz Zaz <zabi...@gmail.com
<javascript:_e(%7B%7D,'cvml','zabi...@gmail.com');>> wrote:

> Hi,
> I'm using openmpi-1.10.2 and sometimes I'm receiving the message below:
> --------------------------------------------------------------------------
> ORTE has lost communication with its daemon located on node:
>
>   hostname:  xxxx
>
> This is usually due to either a failure of the TCP network
> connection to the node, or possibly an internal failure of
> the daemon itself. We cannot recover from this failure, and
> therefore will terminate the job.
>
> --------------------------------------------------------------------------
>
> My applications are fault tolerant and the jobs usually takes weeks to
> finish. Sometimes a hardware problem occurs with one node, for example, the
> node shutdown. I don't want mpi to terminate the job, my jobs usually have
> hundreds of nodes and I don't care if 1 node lost communication.
>
> It's possible to change this behavior of openmpi? I tried to
> set orte_abort_on_non_zero_status to 0 but it didn't work.
>
> Thanks for your help.
>
> Regards,
> Guilherme.
>

Reply via email to