I'm using version 1.4.3 and I forgot to tell that I have made a change in
the orterun.c line 792:

    if (ORTE_JOB_STATE_TERMINATED != exit_state) {
                    exit(0); /* patch*/

Regards


> What version of OMPI are you using? The job should terminate in either
case - what did you do to keep it running after node failure with tcp?

>On Sep 23, 2011, at 12:34 PM, Guilherme V wrote:
>> Hi,
>> I want to know if anybody is having problems with fault tolerant job
using infiniband. When I run my job with tcp if anything happens with one
node, my job keeps running, but if I change my job to use infiniband if
anything happens with the infiniband (i.e cable problems) my job fails.
>>
>> Anybody knows if there is something different that need to be done when
using openib instead tcp?
>>
>> Bellow a example of the message I'm receiving from the mpi.
>>
>> Regards,
>> Guilherme


<http://www.open-mpi.org/mailman/listinfo.cgi/users>

Reply via email to