Do you know if is there another patch available so my application treats the
fail of one node instead of mpi kill the job? This is very important for me,
I have a big cluster and I can't stop my job every time I have some problem
with just one node.
Regards
On Fri, Sep 23, 2011 at 4:34 PM, Ralph
On Sep 23, 2011, at 1:21 PM, Guilherme V wrote:
> I'm using version 1.4.3 and I forgot to tell that I have made a change in the
> orterun.c line 792:
>
> if (ORTE_JOB_STATE_TERMINATED != exit_state) {
> exit(0); /* patch*/
>
I don't see how that change can keep your job
I'm using version 1.4.3 and I forgot to tell that I have made a change in
the orterun.c line 792:
if (ORTE_JOB_STATE_TERMINATED != exit_state) {
exit(0); /* patch*/
Regards
> What version of OMPI are you using? The job should terminate in either
case - what did you do to
What version of OMPI are you using? The job should terminate in either case -
what did you do to keep it running after node failure with tcp?
On Sep 23, 2011, at 12:34 PM, Guilherme V wrote:
> Hi,
> I want to know if anybody is having problems with fault tolerant job using
> infiniband. When I
Hi,
I want to know if anybody is having problems with fault tolerant job using
infiniband. When I run my job with tcp if anything happens with one node, my
job keeps running, but if I change my job to use infiniband if anything
happens with the infiniband (i.e cable problems) my job fails.
Anybody
Hi,
I want to know if anybody is having problems with fault tolerant job using
infiniband. When I run my job with tcp if anything happens with one node, my
job keeps running, but if I change my job to use infiniband if anything
happens with the infiniband (i.e cable problems) my job fails.
Anybody