Beware: this is a lengthy, detailed message.
On Jan 18, 2007, at 3:53 PM, Arif Ali wrote:
1. We have
HW
* 2xBladecenter H
* 2xCisco Infiniband Switch Modules
* 1xCisco Infiniband Switch
* 16x PPC64 JS21 blades each are 4 cores, with Cisco HCA
Can you provide the details of your Cisco HCA?
SW
* SLES 10
* OFED 1.1 w. OpenMPI 1.1.1
I am running the Intel MPI Benchmark (IMB) on the cluster as a part
of validation process for the customer.
I have tried the OpenMPI that comes with OFED 1.1, which gave
spurious "Not Enough Memory" error messages, after looking through
FAQs (with the help of Cisco) I was able to find the problems and
fixes. I used the FAQs to add unlimited soft and hard limits for
memlock, turned RDMA off by using "--mca btl_openib_flags 1". This
still did not work, and still got the Memory problems.
As a clarification: I suggested setting the btl_openib_flags to 1 as
one means of [potentially] reducing the amount of registered memory
to verify that the amount of registered memory available in the
system is the problem (especially because it was dying with large
messages in the all-to-all pattern). With that setting, we got
through the alltoall test (which we previously couldn't). So it
seemed to indicate that on that platform, there isn't much registered
memory available (even though there's 8GB available on each blade).
Are you saying that a full run of the IMB still failed with the same
"cannot register any more memory" kind of error?
I checked with Brad Benton -- an OMPI developer from IBM -- he
confirms that on the JS21s, depending on the version of your
firmware, you will be limited to 256M or 512M of registerable memory
(256M = older firmware, 512M = newer firmware). This could very
definitely be a factor in what is happening here.
Can you let us know what version of the firmware you have?
I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failed
miserably.
Can you describe what happened there? Is it failing in a different way?
I then tried the released version of the OpenMPI-1.2b3, which got
me further than before. Now the benchmark goes through all the
tests until Allgatherv finishes, and it seems that it is waiting to
start AlltoAll, I have waited about 12 hours to see if this
continues. I have since then managed to run AlltoAll, and the rest
of the benchmark separately.
If it does not continue within a few minutes, it's not going to go
anywhere. IMB does do "warmup" sends that may take a few minutes,
but if you've gone 5-10 minutes with no activity, it's likely to be
hung.
FWIW: I can run IMB on 64 processes (16 hosts, 4ppn -- but not a
blade center) with no problem. I.e., it doesn't hang/crash.
Hanging instead of crashing may still be a side-effect of running out
of DMA-able memory -- I don't know enough about the IBM hardware to
say. I doubt that we have explored the error scenarios in OMPI too
much; it's pretty safe to say that if limits are not used and the
system runs out of DMA-able memory, Bad / Undefined things may happen
(a "good" scenario would be that the process/MPI job aborts, a "bad"
scenario would be some kind of deadlock situation).
I have tried a few tunable paramaters, that was suggested by Cisco,
which improved the results, but still hung. The parameters that I
have used to try and diagnose are below. I used the debug/verbose
variables to see if I could see if I could get error messages on
the running of the benchmark.
#orte_debug=1
#btl_openib_verbose=1
#mca_verbose=1
#btl_base_debug=1
btl_openib_flags=1
mpi_leave_pinned=1
mpool_base_use_mem_hooks=1
Note that in that list, only the btl_openib_flags parameter will
[potentially] decrease the amount of registered memory used. Also,
note that mpi_leave_pinned is only useful when utilizing RDMA
operations; so it's effectively a no-op when btl_openib_flags is set
to 1.
--> For those jumping into the conversation late, the value of
btl_openib_flags is a bit mask with the following bits: SEND=1,
PUT=2, GET=4.
With all that was said above, let me provide a few options for
decreasing the amount of registered memory that OMPI uses and also
describe a way to put a strict limit on how much registered memory
OMPI will use.
I'll create some FAQ entries about these exact topics in the Near
Future that will go into more detail, but it might take a few days
because FAQ wording is tricky; the algorithms that OMPI uses and the
tunable parameters that it exports are quite complicated -- I'll want
to sure it's precisely correct for those who land there via Google.
Here's the quick version (Galen/Gleb/Pasha: please correct me if I
get these details incorrect -- thanks!):
- All internal-to-OMPI registered buffers -- whether they are used
for sending or receiving -- are cached on freelists. So if OMPI
registers an internal buffer, sends from it, and then is done with
it, the buffer is not de-registered -- it is put back on the free
list for use in the future.
- OMPI makes IB connections to peer MPI processes lazily. That is,
the first time you MPI_SEND or MPI_RECV to a peer, OMPI makes the
connection.
- OMPI creates an initial set of pre-posted buffers when each IB port
is initialized. The amount registered for each IB endpoint (i.e.,
ports and LIDs) in use on the host by the MPI process upon MPI_INIT is:
2 * btl_openib_free_list_inc *
(btl_openib_max_send_size + btl_openib_eager_limit)
=> NOTE: There's some pretty pictures of the exact meanings of the
max send size and eager limit and how they are used in this paper:
http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/.
The "2" is because there are actually 2 free lists -- one for sending
buffers and one for receiving buffers. Default values for these
three MCA parameters are 32 (free_list_inc), 64k (max_send_size), 12k
(eager_limit), respectively. So each MPI process will preregister
about 4.75MB of memory per endpoint in use on the host. Since these
are all MCA parameters, they are all adjustable at run-time.
- OMPI then pre-registers and pre-posts receive buffers when each
lazy IB connection is made. The buffers are drawn from the freelists
mentioned above, so the first few connections may not actually
register any *new* memory. The freelists register more memory and
dole it out as necessary when requests are made that cannot be
satisfied by what is already on the freelist.
- The number of pre-posted receiver buffers are controlled via the
btl_openib_rd_num and btl_openib_rd_win MCA parameters. OMPI pre-
posts btl_openib_rd_num plus a few more (for control messages) --
resulting in 11 buffers by default per queue pair (OMPI uses 2 QPs,
one high priority for eager fragments and one low priority for send
fragments) per endpoint. So there are
11 * (12k + 64k) = 836k
buffers pre posted for each IB connection endpoint.
=> What I'm guessing is happening in your network is that IMB is
hitting some communication intensive portions and network traffic
either backs up, starts getting congested, or otherwise becomes
"slow", meaning that OMPI is queueing up traffic faster than the
network can process it. Hence, OMPI keeps registering more and more
memory because there's no more memory available on the freelist to
recycle.
- The sending buffering behavior is regulated by the
btl_openib_free_list_max MCA parameter, which defaults to -1 (meaning
that the free list can grow to infinite size). You can set a cap on
this, telling OMPI how many entries it is allowed to have on the
freelist, but that doesn't have a direct correlation as to how much
memory will actually be registered at any one time when
btl_openib_flags > 1 (because OMPI will also be registering and
caching user buffers). Also keep in mind that this MCA parameter
governs the size of both sending and receiving buffer freelists.
That being said, if you use btl_openib_flags=1, you can use
btl_openib_free_list_max as a direct method (because OMPI will *not*
be registering and caching user buffers), but you need to choose a
value that will be acceptable for both the send and receive freelists.
What should happen if OMPI hits the btl_openib_free_list_max limit is
that the upper layer (called the "PML") will internally buffer
messages until more IB registered buffers become available. It's not
entirely accurate, but you can think of it as effectively multiple
levels of queueing going on here: MPI requests, PML buffers, IB
registered buffers, network. Fun stuff! :-)
- A future OMPI feature is an MCA parameter called
mpool_rdma_rcache_size_limit. It defaults to an "unlimited" value,
which means that OMPI will try to register memory forever. But if
you set it to a nonzero positive value (in bytes), OMPI will limit
itself to that much registered memory for each MPI process. This MCA
parameter unfortunately didn't make it into the 1.2 release, but will
be included in some future release. This code is currently on the
OMPI trunk (and nightly snapshots), but not available in the 1.2
branch (and nightly snapshots/releases).
=====
With all those explanations, here's some recommendations for you:
- Try simply setting the size of the eager limit and max send size to
smaller values, perhaps 4k for the eager limit and 12k for the max
send size. This will decrease the amount of registered memory that
OMPI uses for each connection.
- Try setting btl_openib_free_list_max, perhaps in conjunction with
btl_openib_flags=1, to allow you to directly set indirectly or
exactly how much registered memory is used per endpoint.
- If you want to explore the OMPI trunk (with all the normal
disclaimers about development code), try setting
mpool_rdma_rcache_size_limit to a fixed value.
Keep in mind that the intermixing of all of these values is quite
complicated. It's a very, very thin line to walk to balance resource
constraints and application performance. Tweaking one parameter may
give you good resource limits but hose your overall performance.
Another dimension here is that different applications will likely use
different communication patterns, so different sets of values may be
suitable for different applications. It's a complicated parameter
space problem. :-\
2. On another side note, I am having similar problems on another
customer's cluster, where the benchmark hangs but at a different
place each time.
HW specs
* 12x IBM 3455 2xdual Core machines, with Infinipath/pathscale HCAs
* 1x Voltaire Switch
SW
* master: RHEL 4 AS U3
* compute: RHEL 4 WS U3
* OFED 1.1.1 w. OpenMPI-1.1.2
For InfiniPath HCAs, you should probably be using the psm MTL instead
of the openib BTL.
The short version explanation between the two is that MTL plugins are
designed for networks that export MPI-like interfaces (e.g., portals,
tports, MX, InifiniPath). BTL plugins are more geared towards
networks that export RDMA interfaces. You can force using the psm
MTL with:
mpirun --mca pml cm ...
This tells OMPI to use the "cm" PML plugin (PML is the back end to
MPI point-to-point), which, if you've built the "psm" MTL plugin (psm
is the InfiniPath library glue), will use the InfiniPath native back-
end library which will do nice things. Beyond that, someone else
will have to answer -- I have no experience with the psm MTL...
Hope this helps!
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems