Do you know if is there another patch available so my application treats the fail of one node instead of mpi kill the job? This is very important for me, I have a big cluster and I can't stop my job every time I have some problem with just one node.
Regards On Fri, Sep 23, 2011 at 4:34 PM, Ralph Castain <r...@open-mpi.org> wrote: > On Sep 23, 2011, at 1:21 PM, Guilherme V wrote: > > I'm using version 1.4.3 and I forgot to tell that I have made a change in > the orterun.c line 792: > > if (ORTE_JOB_STATE_TERMINATED != exit_state) { > exit(0); /* patch*/ > > > I don't see how that change can keep your job running - we should still > have terminated it. All this does is suppress the error reporting. > > Regardless, openib will cause the process to fail under the described > circumstances, which should cause OMPI to terminate all running procs. I'm > not sure what you are doing with tcp, but it could be that there are > alternative paths available - e.g., you have multiple NICs and remove one > cable, but the other paths remain viable. > > Regards > > > > What version of OMPI are you using? The job should terminate in either > case - what did you do to keep it running after node failure with tcp? > > >On Sep 23, 2011, at 12:34 PM, Guilherme V wrote: > >> Hi, > >> I want to know if anybody is having problems with fault tolerant job > using infiniband. When I run my job with tcp if anything happens with one > node, my job keeps running, but if I change my job to use infiniband if > anything happens with the infiniband (i.e cable problems) my job fails. > >> > >> Anybody knows if there is something different that need to be done when > using openib instead tcp? > >> > >> Bellow a example of the message I'm receiving from the mpi. > >> > >> Regards, > >> Guilherme > > > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >