On Aug 18, 2009, at 10:59 AM, Oskar Enoksson wrote:
The question is, however, why is cl120 not acking messages? What
is the application? What MPI calls does this application use?
Scott
The reason in this case was that cl120 had some kind of hardware
problem, perhaps memory error or myrinet NIC hardware error. The
system hung.
I will try MX_ZOMBIE_SEND=0, thanks for the hint!
But still I'm curious, is there no way to have some kind of time out
time limit on the waiting hosts? E.g. one hour?
There is a send timeout in MX. There is no receive timeout in MPI or MX.
The application could add pending receives with a timestamp to a
pending queue and walk the queue periodically. If it finds a receive
that has exceeded the application's threshold, it could call
MPI_Cancel().
Scott