On Jul 1, 2009, at 8:01 AM, Jeff Squyres (jsquyres) wrote:
Random thought: would it be easy for the output of cat /dev/knem to
indicate whether IOAT hardware is present?
Well *that* was replying to the wrong message. :-)
A real reply is below...
> I have problems running large jobs on a PC cluster with OpenMPI
V1.3.
> Typically the error appears only for processor count >= 2048
(actually
> cores), sometimes also bellow.
>
> The nodes (Intel Nehalem, 2 procs, 4 cores each) run (scientific?)
> linux.
> $> uname -a
> Linux cl3fr1 2.6.18-128.1.10.el5 #1 SMP Thu May 7 12:48:13 EDT 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
> The code starts normally, reads it's input data sets (~4GB), does
some
> initialization and then continues the actual calculations. So time
> after that, it fails with the following error message:
>
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one]
> error creating qp errno says Cannot allocate memory
What kind of communication pattern does the application use? Does it
use all-to-all? Open MPI makes OpenFabrics verbs (i.e., IB in your
case) connections lazily, so you might not see these problems until
OMPI is trying to make connections -- well past MPI_INIT -- and then
failing when it runs out of HCA QP resources.
> Memory usage by the application should not be the problem. At this
> proc
> count, the code uses only ~100MB per proc. Also, the code runs for
> lower
> number of procs where it consumes more mem.
>
>
> I also get the apparently secondary error messages:
>
> [n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv:
> readv
> failed: Connection reset by peer (104)
This error is likely caused by the first error -- processes fail and
then TCP connections get reset, causing the readv() errors. Perhaps
we should have a better error message for this...
> The cluster uses InfiniBand connections. I am aware only of the
> following parameter changes (systemwide):
> btl_openib_ib_min_rnr_timer = 25
> btl_openib_ib_timeout = 20
>
> $> ulimit -l
> unlimited
>
>
> I attached:
> 1) $> ompi_info --all > ompi_info.log
> 2) stderr from the PBS: stderr.log
Open MPI v1.3 is actually quite flexible in how it creates and uses
OpenFabrics QPs. By default, it likely creates 4 QPs (of varying
buffer sizes) between each pair of MPI processes that connect to each
other. You can tune this down to be 3, 2, or even 1 QP to reduce the
number of QPs that are being opened (and therefore, presumably, not
exhaust HCA QP resources).
Alternatively / additionally, you may wish to enable XRC if you have
recent enough Mellanox HCAs. This should also save on QP resources.
You can set both of these things via the mca_btl_openib_receive_queues
MCA parameter. It takes a colon-delimited list of receive queues
(which directly imply how many QP's to create). There are 3 kinds of
entries: per-peer QPs, shared receive queues, and XRC receive queues.
Here's a description of each:
Per-peer receive queues require between 2 and 5 parameters:
1. Buffer size in bytes (mandatory)
2. Number of buffers (mandatory)
3. Low buffer count watermark (optional; defaults to (num_buffers /
2))
4. Credit window size (optional; defaults to (low_watermark / 2))
5. Number of buffers reserved for credit messages (optional;
defaults to (num_buffers*2-1)/credit_window)
Example: P,128,256,128,16
- 128 byte buffers
- 256 buffers to receive incoming MPI messages
- When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
- If the number of available credits reaches 16, send an explicit
credit message to the sender
- Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
reserved for explicit credit messages
Shared receive queues can take between 2 and 4 parameters:
1. Buffer size in bytes (mandatory)
2. Number of buffers (mandatory)
3. Low buffer count watermark (optional; defaults to (num_buffers /
2))
4. Maximum number of outstanding sends a sender can have (optional;
defaults to (low_watermark / 4)
Example: S,1024,256,128,32
- 1024 byte buffers
- 256 buffers to receive incoming MPI messages
- When the number of available buffers reaches 128, re-post 128 more
buffers to reach a total of 256
- A sender will not send to a peer unless it has less than 32
outstanding sends to that peer.
I believe that XRC receive queues are exactly the same as SRQs but
with "X" instead of "S". If you use XRC, you can *only* specify XRC
receive queues -- you cannot also specify PP or SRQ receive queues.
Mellanox may fix that someday, but this restriction currently holds
for the 1.3 OMPI series.
The default value of btl_openib_receive_queues is likely to be set by
the $prefix/share/openmpi/mca-btl-openib-device-params.ini file. Look
in that file for your specific HCA device and see the "receive_queues"
value set for it. You can override this value on the mpirun command
line or by editing this file (make sure to edit this file on every
node!). If your device does not have a receive_queues value in that
file, you can look up the default value with ompi_info:
$ ompi_info --param btl openib --parsable | grep receive_queues
mca:btl:openib:param:btl_openib_receive_queues:value: P,
128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32
mca:btl:openib:param:btl_openib_receive_queues:data_source: default
value
mca:btl:openib:param:btl_openib_receive_queues:status: writable
mca:btl:openib:param:btl_openib_receive_queues:help: Colon-delimited,
comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
mca:btl:openib:param:btl_openib_receive_queues:deprecated: no
You can see that in my setup, receive_queues defaults to 4 QPs: 1 PP
for small messages (256 buffers of size 128) and then 3 SRQs of
increasing buffer size.
More specifically, if you know the exact message sizes in your
application, you can tune the receive_queues value to exactly fit your
messages and get a very high degree of registered memory utilization.
In the default case above, Open MPI posts a truckload of short message
buffers and progressively smaller number of larger message buffers.
This allows for lots and lots of short message resources (e.g., high
message injection/reception rates) while balancing a few longer
message resources.
Keep in mind that Open MPI only uses these receive_queues value for
short/medium messages -- longer messages are passed via RDMA and
bypass the buffer sizes specified in receive_queues. See http://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.2
and http://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3
.
Hope this helps!
--
Jeff Squyres
Cisco Systems