Hi Oskar,
Oskar Enoksson wrote:
The reason in this case was that cl120 had some kind of hardware
problem, perhaps memory error or myrinet NIC hardware error. The system
hung.
I will try MX_ZOMBIE_SEND=0, thanks for the hint!
I would not recommend to use that setting. It will affect performance,
use a code path that is less tested and not really address the problem.
As small messages are buffered in MX, a send can return immediately as
the send buffer can be reused right away. However, if the MX lib fail to
reliably deliver the message, it will eventually call the asynchronous
error handler to report the problem. The default async error handler
only prints a message, leaving to the application the choice of
recovery. The right way to address the problem would be for OpenMPI to
register its own asynchronous error handler in the MX BTL/MTL, and
signal to ORTE to terminate the job when a send timeout has occurred.
We will implement this mechanism and push it on the trunk shortly.
Thanks
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com