My application have a heartbeat that checks if a node is alive and can redistribute a task to another node if the master lost communication with it. The application also have a checkpoint/restart, but since I usually have hundreds of nodes for one job and usually takes a long time to restart the job, in this case I would prefer to go on with the job and not terminate it. It's possible?
Regards, Guilherme. On Mon, May 16, 2016 at 12:28 PM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > What do you mean by fault tolerant application ? > from an OpenMPI point of view, if such a connection is lost, your > application will no more be able to communicate, so killing it is the best > option. > if your application has built in checkpoint/restart, then you have to > restart it with mpirun after the first mpirun aborts and your environment > is fixed. > or your batch manager should restart/resubmit the job, possibly on a > different set of nodes. > > makes sense ? > > Cheers, > > Gilles > > On Monday, May 16, 2016, Zabiziz Zaz <zabi...@gmail.com> wrote: > >> Hi, >> I'm using openmpi-1.10.2 and sometimes I'm receiving the message below: >> -------------------------------------------------------------------------- >> ORTE has lost communication with its daemon located on node: >> >> hostname: xxxx >> >> This is usually due to either a failure of the TCP network >> connection to the node, or possibly an internal failure of >> the daemon itself. We cannot recover from this failure, and >> therefore will terminate the job. >> >> -------------------------------------------------------------------------- >> >> My applications are fault tolerant and the jobs usually takes weeks to >> finish. Sometimes a hardware problem occurs with one node, for example, the >> node shutdown. I don't want mpi to terminate the job, my jobs usually have >> hundreds of nodes and I don't care if 1 node lost communication. >> >> It's possible to change this behavior of openmpi? I tried to >> set orte_abort_on_non_zero_status to 0 but it didn't work. >> >> Thanks for your help. >> >> Regards, >> Guilherme. >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29219.php >