I'm using version 1.4.3 and I forgot to tell that I have made a change in the orterun.c line 792:
if (ORTE_JOB_STATE_TERMINATED != exit_state) { exit(0); /* patch*/ Regards > What version of OMPI are you using? The job should terminate in either case - what did you do to keep it running after node failure with tcp? >On Sep 23, 2011, at 12:34 PM, Guilherme V wrote: >> Hi, >> I want to know if anybody is having problems with fault tolerant job using infiniband. When I run my job with tcp if anything happens with one node, my job keeps running, but if I change my job to use infiniband if anything happens with the infiniband (i.e cable problems) my job fails. >> >> Anybody knows if there is something different that need to be done when using openib instead tcp? >> >> Bellow a example of the message I'm receiving from the mpi. >> >> Regards, >> Guilherme <http://www.open-mpi.org/mailman/listinfo.cgi/users>