Glenn,

If the error message is about "privileged" memory, i.e. locked or
pinned memory, on Solaris you can increase the amount of available
privileged memory by editing the /etc/project file on the nodes.

Amount available (example of typical value is 900MB):
% prctl -n project.max-device-locked-memory -i project default

Edit /etc/project:
Default line of interest :
   default:3::::

Change to, for example 4GB :
   default:3::::project.max-device-locked-memory=(priv,4197152000,deny)

What to set ompi_free_list_max to?  By default each connection will
post 8 recs, at 7 sends, 32 rdma writes and possibly a few internal control
messages. Since these are pulling from the same free list I believe a
sufficient value could be calculated as : 50 * (np - 1). Memory will still be consumed but this should lesson the amount of privileged memory required.

Memory consumption is something Sun is actively investigating. What
size job are you running?

Not sure if this part of the issue but another possiblity, if the
communication pattern of the MPI job is actually starving one
connection out of memory you could try setting "--mca
mpi_preconnect_all 1" and "--mca btl_udapl_max_eager_rdma_peers X",
where X is equal to np. This will establish a connection between
all processes in the job as well as create a channel for short
messages to use rdma functionality. By establishing this channel
to all connections before the MPI job starts up each peer connection
will be gauranteed some amount of privilege memory over which it could
potentially communicate. Of course you do take the hit of wireup time for all connections at MPI_Init.

-DON

Brian Barrett wrote:

On Aug 2, 2007, at 4:22 PM, Glenn Carver wrote:

Hopefully an easy question to answer... is it possible to get at the
values of mca parameters whilst a program is running?   What I had in
mind was either an open-mpi function to call which would print the
current values of mca parameters or a function to call for specific
mca parameters. I don't want to interrupt the running of the
application.

Bit of background. I have a large F90 application running with
OpenMPI (as Sun Clustertools 7) on Opteron CPUs with an IB network.
We're seeing swap thrashing occurring on some of the nodes at times
and having searched the archives and read the FAQ believe we may be
seeing the problem described in:
http://www.open-mpi.org/community/lists/users/2007/01/2511.php
where the udapl free list is growing to a point where lockable memory runs out.

Problem is, I have no feel for the kinds of numbers  that
"btl_udapl_free_list_max" might safely get up to?  Hence the request
to print mca parameter values whilst the program is running to see if
we can tie in high values of this parameter to when we're seeing swap
thrashing.

Good news, the answer is easy. Bad news is, it's not the one you want. btl_udapl_free_list_max is the *greatest* the list will ever be allowed to grow to, not it's current size. So if you don't specify a value and use the default of -1, it will return -1 for the life of the application, regardless of how big those free lists actually get. If you specify value X, it'll return X for the lift of the application, as well.

There is not a good way for a user to find out the current size of a free list or the largest it got for the life of an application (currently those two will always be the same, but that's another story). Your best bet is to set the parameter to some value (say, 128 or 256) and see if that helps with the swapping.


Brian

Reply via email to