Hi Oskar,

Oskar Enoksson wrote:
The reason in this case was that cl120 had some kind of hardware problem, perhaps memory error or myrinet NIC hardware error. The system hung.

I will try MX_ZOMBIE_SEND=0, thanks for the hint!

I would not recommend to use that setting. It will affect performance, use a code path that is less tested and not really address the problem.

As small messages are buffered in MX, a send can return immediately as the send buffer can be reused right away. However, if the MX lib fail to reliably deliver the message, it will eventually call the asynchronous error handler to report the problem. The default async error handler only prints a message, leaving to the application the choice of recovery. The right way to address the problem would be for OpenMPI to register its own asynchronous error handler in the MX BTL/MTL, and signal to ORTE to terminate the job when a send timeout has occurred.

We will implement this mechanism and push it on the trunk shortly.

Thanks

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com

Reply via email to