Patrick Geoffray <patr...@myri.com> wrote: > Hi Oskar, > > Oskar Enoksson wrote: >> The reason in this case was that cl120 had some kind of hardware >> problem, perhaps memory error or myrinet NIC hardware error. The system >> hung. >> >> I will try MX_ZOMBIE_SEND=0, thanks for the hint! > > I would not recommend to use that setting. It will affect performance, > use a code path that is less tested and not really address the problem. > > As small messages are buffered in MX, a send can return immediately as > the send buffer can be reused right away. However, if the MX lib fail to > reliably deliver the message, it will eventually call the asynchronous > error handler to report the problem. The default async error handler > only prints a message, leaving to the application the choice of > recovery. The right way to address the problem would be for OpenMPI to > register its own asynchronous error handler in the MX BTL/MTL, and > signal to ORTE to terminate the job when a send timeout has occurred. > > We will implement this mechanism and push it on the trunk shortly. > > Thanks
Sounds great, I'm looking forward to it. Thanks a lot.