Scott Atchley <atch...@myri.com> wrote:
Long answer:
The messages below indicate that these processes were all trying to
send to cl120. It did not ack their messages after 1000 resend
attempts (each retry is attempted with a 0.5 second interval) which is
about 8.3 minutes (500 seconds).
The messages also indicate that the message was a send_small which
means it was 128 bytes or less. MX has MPI like semantics and allow
for completion after the message has been either buffered or
delivered. In this case, it was buffered and OMPI was most likely able
to complete it successfully. The message was not able to be delivered,
however, and its timeout caused MX to fail all future sends to that
host. On the next mx_isend(), OMPI will detect a failure.
Since it does not detect failure, my guess is that the process has not
tried to send again to that host. They then end up waiting forever.
They can change MX's behavior so that it does not complete a send
until the receiver has acked it by exporting:
MX_ZOMBIE_SEND=0
This will hurt benchmark performance, but real application performance
should not be affected.
The question is, however, why is cl120 not acking messages? What is
the application? What MPI calls does this application use?
Scott
The reason in this case was that cl120 had some kind of hardware
problem, perhaps memory error or myrinet NIC hardware error. The system
hung.
I will try MX_ZOMBIE_SEND=0, thanks for the hint!
But still I'm curious, is there no way to have some kind of time out
time limit on the waiting hosts? E.g. one hour?