On Jun 27, 2012, at 6:32 PM, Martin Siegert wrote: > However, there is another issue that may affect the performance of the 1.6.1 > version. I see a LOT of the following messages on stderr: > > -------------------------------------------------------------------------- > The OpenFabrics (openib) BTL failed to register memory in the driver. > Please check /var/log/messages or dmesg for driver specific failure > reason. > The failure occured here: > > Local host: b413 > Device: mlx4_0 > Function: openib_reg_mr() > Errno says: Cannot allocate memory (errno=12) > > You may need to consult with your system administrator to get this > problem fixed. > --------------------------------------------------------------------------
There's been a LOT of discussion about this by the developers (both on-line and off). We've removed that error message, so at least you won't see it ad infinitum. What's happening is that you're getting a registered memory imbalance -- see http://blogs.cisco.com/performance/registered-memory-imbalances/ for some details. The fix we put in solves registered memory exhaustion in most cases (it falls back to send/recv in that case), but due to OMPI's lazy wire up, it can still happen later (e.g., late in an application you do an MPI_SEND to a new recipient, but it can't allocate a new QP because it's out of registered memory). It turns out to be a rather sticky problem to solve. We're still debating. :-\ -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/