Do you know if is there another patch available so my application treats the
fail of one node instead of mpi kill the job? This is very important for me,
I have a big cluster and I can't stop my job every time I have some problem
with just one node.
Regards
On Fri, Sep 23, 2011 at 4:34 PM, Ralph
On Sep 23, 2011, at 1:21 PM, Guilherme V wrote:
> I'm using version 1.4.3 and I forgot to tell that I have made a change in the
> orterun.c line 792:
>
> if (ORTE_JOB_STATE_TERMINATED != exit_state) {
> exit(0); /* patch*/
>
I don't see how that change can keep your job
I'm using version 1.4.3 and I forgot to tell that I have made a change in
the orterun.c line 792:
if (ORTE_JOB_STATE_TERMINATED != exit_state) {
exit(0); /* patch*/
Regards
> What version of OMPI are you using? The job should terminate in either
case - what did you do to
What version of OMPI are you using? The job should terminate in either case -
what did you do to keep it running after node failure with tcp?
On Sep 23, 2011, at 12:34 PM, Guilherme V wrote:
> Hi,
> I want to know if anybody is having problems with fault tolerant job using
> infiniband. When I