Patrick Geoffray <patr...@myri.com> wrote:
> Hi Oskar,
> 
> Oskar Enoksson wrote:
>> The reason in this case was that cl120 had some kind of hardware 
>> problem, perhaps memory error or myrinet NIC hardware error. The system 
>> hung.
>>
>> I will try MX_ZOMBIE_SEND=0, thanks for the hint!
> 
> I would not recommend to use that setting. It will affect performance, 
> use a code path that is less tested and not really address the problem.
> 
> As small messages are buffered in MX, a send can return immediately as 
> the send buffer can be reused right away. However, if the MX lib fail to 
> reliably deliver the message, it will eventually call the asynchronous 
> error handler to report the problem. The default async error handler 
> only prints a message, leaving to the application the choice of 
> recovery. The right way to address the problem would be for OpenMPI to 
> register its own asynchronous error handler in the MX BTL/MTL, and 
> signal to ORTE to terminate the job when a send timeout has occurred.
> 
> We will implement this mechanism and push it on the trunk shortly.
> 
> Thanks

Sounds great, I'm looking forward to it. Thanks a lot.

Reply via email to