On Jun 27, 2012, at 6:32 PM, Martin Siegert wrote:

> However, there is another issue that may affect the performance of the 1.6.1
> version. I see a LOT of the following messages on stderr:
> 
> --------------------------------------------------------------------------
> The OpenFabrics (openib) BTL failed to register memory in the driver.
> Please check /var/log/messages or dmesg for driver specific failure
> reason.
> The failure occured here:
> 
>  Local host:    b413
>  Device:        mlx4_0
>  Function:      openib_reg_mr()
>  Errno says:    Cannot allocate memory (errno=12)
> 
> You may need to consult with your system administrator to get this
> problem fixed.
> --------------------------------------------------------------------------

There's been a LOT of discussion about this by the developers (both on-line and 
off).

We've removed that error message, so at least you won't see it ad infinitum.

What's happening is that you're getting a registered memory imbalance -- see 
http://blogs.cisco.com/performance/registered-memory-imbalances/ for some 
details.

The fix we put in solves registered memory exhaustion in most cases (it falls 
back to send/recv in that case), but due to OMPI's lazy wire up, it can still 
happen later (e.g., late in an application you do an MPI_SEND to a new 
recipient, but it can't allocate a new QP because it's out of registered 
memory).

It turns out to be a rather sticky problem to solve.  We're still debating.  :-\

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to