What do you mean by fault tolerant application ? from an OpenMPI point of view, if such a connection is lost, your application will no more be able to communicate, so killing it is the best option. if your application has built in checkpoint/restart, then you have to restart it with mpirun after the first mpirun aborts and your environment is fixed. or your batch manager should restart/resubmit the job, possibly on a different set of nodes.
makes sense ? Cheers, Gilles On Monday, May 16, 2016, Zabiziz Zaz <zabi...@gmail.com <javascript:_e(%7B%7D,'cvml','zabi...@gmail.com');>> wrote: > Hi, > I'm using openmpi-1.10.2 and sometimes I'm receiving the message below: > -------------------------------------------------------------------------- > ORTE has lost communication with its daemon located on node: > > hostname: xxxx > > This is usually due to either a failure of the TCP network > connection to the node, or possibly an internal failure of > the daemon itself. We cannot recover from this failure, and > therefore will terminate the job. > > -------------------------------------------------------------------------- > > My applications are fault tolerant and the jobs usually takes weeks to > finish. Sometimes a hardware problem occurs with one node, for example, the > node shutdown. I don't want mpi to terminate the job, my jobs usually have > hundreds of nodes and I don't care if 1 node lost communication. > > It's possible to change this behavior of openmpi? I tried to > set orte_abort_on_non_zero_status to 0 but it didn't work. > > Thanks for your help. > > Regards, > Guilherme. >