Re: [OMPI users] How to make a job abort when one host dies?

Oskar Enoksson Tue, 18 Aug 2009 10:59:35 -0400

Scott Atchley <atch...@myri.com> wrote:

Long answer:
The messages below indicate that these processes were all trying tosend to cl120. It did not ack their messages after 1000 resendattempts (each retry is attempted with a 0.5 second interval) which isabout 8.3 minutes (500 seconds).
The messages also indicate that the message was a send_small whichmeans it was 128 bytes or less. MX has MPI like semantics and allowfor completion after the message has been either buffered ordelivered. In this case, it was buffered and OMPI was most likely ableto complete it successfully. The message was not able to be delivered,however, and its timeout caused MX to fail all future sends to thathost. On the next mx_isend(), OMPI will detect a failure.
Since it does not detect failure, my guess is that the process has nottried to send again to that host. They then end up waiting forever.
They can change MX's behavior so that it does not complete a senduntil the receiver has acked it by exporting:
MX_ZOMBIE_SEND=0
This will hurt benchmark performance, but real application performanceshould not be affected.
The question is, however, why is cl120 not acking messages? What isthe application? What MPI calls does this application use?
Scott

The reason in this case was that cl120 had some kind of hardwareproblem, perhaps memory error or myrinet NIC hardware error. The systemhung.


I will try MX_ZOMBIE_SEND=0, thanks for the hint!

But still I'm curious, is there no way to have some kind of time outtime limit on the waiting hosts? E.g. one hour?

Re: [OMPI users] How to make a job abort when one host dies?

Reply via email to