On Aug 18, 2009, at 10:59 AM, Oskar Enoksson wrote:

The question is, however, why is cl120 not acking messages? What is the application? What MPI calls does this application use?

Scott

The reason in this case was that cl120 had some kind of hardware problem, perhaps memory error or myrinet NIC hardware error. The system hung.

I will try MX_ZOMBIE_SEND=0, thanks for the hint!

But still I'm curious, is there no way to have some kind of time out time limit on the waiting hosts? E.g. one hour?

There is a send timeout in MX. There is no receive timeout in MPI or MX.

The application could add pending receives with a timestamp to a pending queue and walk the queue periodically. If it finds a receive that has exceeded the application's threshold, it could call MPI_Cancel().

Scott

Reply via email to