Re: [OMPI users] Fault Tolerant with openib

2011-09-27 Thread Guilherme V
Do you know if is there another patch available so my application treats the fail of one node instead of mpi kill the job? This is very important for me, I have a big cluster and I can't stop my job every time I have some problem with just one node. Regards On Fri, Sep 23, 2011 at 4:34 PM, Ralph

Re: [OMPI users] Fault Tolerant with openib

2011-09-23 Thread Ralph Castain
On Sep 23, 2011, at 1:21 PM, Guilherme V wrote: > I'm using version 1.4.3 and I forgot to tell that I have made a change in the > orterun.c line 792: > > if (ORTE_JOB_STATE_TERMINATED != exit_state) { > exit(0); /* patch*/ > I don't see how that change can keep your job

Re: [OMPI users] Fault Tolerant with openib

2011-09-23 Thread Guilherme V
I'm using version 1.4.3 and I forgot to tell that I have made a change in the orterun.c line 792: if (ORTE_JOB_STATE_TERMINATED != exit_state) { exit(0); /* patch*/ Regards > What version of OMPI are you using? The job should terminate in either case - what did you do to

Re: [OMPI users] Fault Tolerant with openib

2011-09-23 Thread Ralph Castain
What version of OMPI are you using? The job should terminate in either case - what did you do to keep it running after node failure with tcp? On Sep 23, 2011, at 12:34 PM, Guilherme V wrote: > Hi, > I want to know if anybody is having problems with fault tolerant job using > infiniband. When I

[OMPI users] Fault Tolerant with openib

2011-09-23 Thread Guilherme V
Hi, I want to know if anybody is having problems with fault tolerant job using infiniband. When I run my job with tcp if anything happens with one node, my job keeps running, but if I change my job to use infiniband if anything happens with the infiniband (i.e cable problems) my job fails. Anybody

[OMPI users] Fault Tolerant with openib

2011-09-23 Thread Guilherme V
Hi, I want to know if anybody is having problems with fault tolerant job using infiniband. When I run my job with tcp if anything happens with one node, my job keeps running, but if I change my job to use infiniband if anything happens with the infiniband (i.e cable problems) my job fails. Anybody