Ben,

You may try to disable registration cache,  it may relieve pressure on memory 
resources. 
--mca mpi_leave_pinned 0

You may find a bit more details here: 
http://www.open-mpi.org/faq/?category=openfabrics#large-message-leave-pinned

Using the option you may observe drop in BW performance.

Regards,
Pavel (Pasha) Shamis
---
Computer Science Research Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Jul 5, 2013, at 3:33 PM, Ben <benjamin.m.a...@nasa.gov> wrote:

> I'm part of a team that maintains a global climate model running under 
> mpi. Recently we have been trying it out with different mpi stacks
> at high resolution/processor counts.
> At one point in the code there is a large number of mpi_isends/mpi_recv 
> (tens to hundreds of thousands) when data distributed across all mpi 
> processes must be collective on a particular processor or processors be 
> transformed to a new resolution before writing. At first the model was
> crashing with a message:
> "A process failed to create a queue pair. This usually means either the 
> device has run out of queue pairs (too many connections) or there are 
> insufficient resources available to allocate a queue pair (out of 
> memory). The latter can happen if either 1) insufficient memory is 
> available, or 2) no more physical memory can be registered with the device."
> when it hit the part of code with the send/receives. Watching the node 
> memory in an xterm I could see the memory skyrocket and fill the node.
> 
> Somewhere we found a suggestion try using the xrc queues 
> (http://www.open-mpi.org/faq/?category=openfabrics#ib-xrc) to get around 
> this problem and indeed running with
> 
> setenv OMPI_MCA_btl_openib_receive_queues 
> "X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32"
> mpirun --bind-to-core -np numproc ./app
> 
> allowed the model to successfully run. It still seems to use a large 
> amount of memory when it writes (on the order of several Gb). Does 
> anyone have any  suggestions on how to perhaps tweak the settings to 
> help with memory use.
> 
> -- 
> Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
> NASA GSFC,  Global Modeling and Assimilation Office
> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
> Phone: 301-286-9176               Fax: 301-614-6246
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to