Scott Atchley <atch...@myri.com> wrote:

Long answer:

The messages below indicate that these processes were all trying to send to cl120. It did not ack their messages after 1000 resend attempts (each retry is attempted with a 0.5 second interval) which is about 8.3 minutes (500 seconds).

The messages also indicate that the message was a send_small which means it was 128 bytes or less. MX has MPI like semantics and allow for completion after the message has been either buffered or delivered. In this case, it was buffered and OMPI was most likely able to complete it successfully. The message was not able to be delivered, however, and its timeout caused MX to fail all future sends to that host. On the next mx_isend(), OMPI will detect a failure.

Since it does not detect failure, my guess is that the process has not tried to send again to that host. They then end up waiting forever.

They can change MX's behavior so that it does not complete a send until the receiver has acked it by exporting:

MX_ZOMBIE_SEND=0

This will hurt benchmark performance, but real application performance should not be affected.

The question is, however, why is cl120 not acking messages? What is the application? What MPI calls does this application use?

Scott
The reason in this case was that cl120 had some kind of hardware problem, perhaps memory error or myrinet NIC hardware error. The system hung.

I will try MX_ZOMBIE_SEND=0, thanks for the hint!

But still I'm curious, is there no way to have some kind of time out time limit on the waiting hosts? E.g. one hour?

Reply via email to