Ben, You may try to disable registration cache, it may relieve pressure on memory resources. --mca mpi_leave_pinned 0
You may find a bit more details here: http://www.open-mpi.org/faq/?category=openfabrics#large-message-leave-pinned Using the option you may observe drop in BW performance. Regards, Pavel (Pasha) Shamis --- Computer Science Research Group Computer Science and Math Division Oak Ridge National Laboratory On Jul 5, 2013, at 3:33 PM, Ben <benjamin.m.a...@nasa.gov> wrote: > I'm part of a team that maintains a global climate model running under > mpi. Recently we have been trying it out with different mpi stacks > at high resolution/processor counts. > At one point in the code there is a large number of mpi_isends/mpi_recv > (tens to hundreds of thousands) when data distributed across all mpi > processes must be collective on a particular processor or processors be > transformed to a new resolution before writing. At first the model was > crashing with a message: > "A process failed to create a queue pair. This usually means either the > device has run out of queue pairs (too many connections) or there are > insufficient resources available to allocate a queue pair (out of > memory). The latter can happen if either 1) insufficient memory is > available, or 2) no more physical memory can be registered with the device." > when it hit the part of code with the send/receives. Watching the node > memory in an xterm I could see the memory skyrocket and fill the node. > > Somewhere we found a suggestion try using the xrc queues > (http://www.open-mpi.org/faq/?category=openfabrics#ib-xrc) to get around > this problem and indeed running with > > setenv OMPI_MCA_btl_openib_receive_queues > "X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32" > mpirun --bind-to-core -np numproc ./app > > allowed the model to successfully run. It still seems to use a large > amount of memory when it writes (on the order of several Gb). Does > anyone have any suggestions on how to perhaps tweak the settings to > help with memory use. > > -- > Ben Auer, PhD SSAI, Scientific Programmer/Analyst > NASA GSFC, Global Modeling and Assimilation Office > Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD 20771 > Phone: 301-286-9176 Fax: 301-614-6246 > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users