On Jul 1, 2009, at 8:01 AM, Jeff Squyres (jsquyres) wrote:

Random thought: would it be easy for the output of cat /dev/knem to
indicate whether IOAT hardware is present?


Well *that* was replying to the wrong message.  :-)

A real reply is below...

> I have problems running large jobs on a PC cluster with OpenMPI V1.3. > Typically the error appears only for processor count >= 2048 (actually
> cores), sometimes also bellow.
>
> The nodes (Intel Nehalem, 2 procs, 4 cores each) run (scientific?)
> linux.
> $> uname -a
> Linux cl3fr1 2.6.18-128.1.10.el5 #1 SMP Thu May 7 12:48:13 EDT 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
> The code starts normally, reads it's input data sets (~4GB), does some
> initialization and then continues the actual calculations. So time
> after that, it fails with the following error message:
>
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one]
> error creating qp errno says Cannot allocate memory


What kind of communication pattern does the application use? Does it use all-to-all? Open MPI makes OpenFabrics verbs (i.e., IB in your case) connections lazily, so you might not see these problems until OMPI is trying to make connections -- well past MPI_INIT -- and then failing when it runs out of HCA QP resources.

> Memory usage by the application should not be the problem. At this
> proc
> count, the code uses only ~100MB per proc. Also, the code runs for
> lower
> number of procs where it consumes more mem.
>
>
> I also get the apparently secondary error messages:
>
> [n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv:
> readv
> failed: Connection reset by peer (104)


This error is likely caused by the first error -- processes fail and then TCP connections get reset, causing the readv() errors. Perhaps we should have a better error message for this...

> The cluster uses InfiniBand connections. I am aware only of the
> following parameter changes (systemwide):
> btl_openib_ib_min_rnr_timer = 25
> btl_openib_ib_timeout = 20
>
> $> ulimit -l
> unlimited
>
>
> I attached:
> 1) $> ompi_info --all > ompi_info.log
> 2) stderr from the PBS: stderr.log


Open MPI v1.3 is actually quite flexible in how it creates and uses OpenFabrics QPs. By default, it likely creates 4 QPs (of varying buffer sizes) between each pair of MPI processes that connect to each other. You can tune this down to be 3, 2, or even 1 QP to reduce the number of QPs that are being opened (and therefore, presumably, not exhaust HCA QP resources).

Alternatively / additionally, you may wish to enable XRC if you have recent enough Mellanox HCAs. This should also save on QP resources.

You can set both of these things via the mca_btl_openib_receive_queues MCA parameter. It takes a colon-delimited list of receive queues (which directly imply how many QP's to create). There are 3 kinds of entries: per-peer QPs, shared receive queues, and XRC receive queues. Here's a description of each:

Per-peer receive queues require between 2 and 5 parameters:

  1. Buffer size in bytes (mandatory)
  2. Number of buffers (mandatory)
3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
  4. Credit window size (optional; defaults to (low_watermark / 2))
  5. Number of buffers reserved for credit messages (optional;
     defaults to (num_buffers*2-1)/credit_window)

  Example: P,128,256,128,16
  - 128 byte buffers
  - 256 buffers to receive incoming MPI messages
  - When the number of available buffers reaches 128, re-post 128 more
    buffers to reach a total of 256
  - If the number of available credits reaches 16, send an explicit
    credit message to the sender
  - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
    reserved for explicit credit messages

Shared receive queues can take between 2 and 4 parameters:

  1. Buffer size in bytes (mandatory)
  2. Number of buffers (mandatory)
3. Low buffer count watermark (optional; defaults to (num_buffers / 2))
  4. Maximum number of outstanding sends a sender can have (optional;
     defaults to (low_watermark / 4)

  Example: S,1024,256,128,32
  - 1024 byte buffers
  - 256 buffers to receive incoming MPI messages
  - When the number of available buffers reaches 128, re-post 128 more
    buffers to reach a total of 256
  - A sender will not send to a peer unless it has less than 32
    outstanding sends to that peer.

I believe that XRC receive queues are exactly the same as SRQs but with "X" instead of "S". If you use XRC, you can *only* specify XRC receive queues -- you cannot also specify PP or SRQ receive queues. Mellanox may fix that someday, but this restriction currently holds for the 1.3 OMPI series.

The default value of btl_openib_receive_queues is likely to be set by the $prefix/share/openmpi/mca-btl-openib-device-params.ini file. Look in that file for your specific HCA device and see the "receive_queues" value set for it. You can override this value on the mpirun command line or by editing this file (make sure to edit this file on every node!). If your device does not have a receive_queues value in that file, you can look up the default value with ompi_info:

$ ompi_info --param btl openib --parsable | grep receive_queues
mca:btl:openib:param:btl_openib_receive_queues:value: P, 128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 mca:btl:openib:param:btl_openib_receive_queues:data_source: default value
mca:btl:openib:param:btl_openib_receive_queues:status: writable
mca:btl:openib:param:btl_openib_receive_queues:help: Colon-delimited, comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4
mca:btl:openib:param:btl_openib_receive_queues:deprecated: no

You can see that in my setup, receive_queues defaults to 4 QPs: 1 PP for small messages (256 buffers of size 128) and then 3 SRQs of increasing buffer size.

More specifically, if you know the exact message sizes in your application, you can tune the receive_queues value to exactly fit your messages and get a very high degree of registered memory utilization. In the default case above, Open MPI posts a truckload of short message buffers and progressively smaller number of larger message buffers. This allows for lots and lots of short message resources (e.g., high message injection/reception rates) while balancing a few longer message resources.

Keep in mind that Open MPI only uses these receive_queues value for short/medium messages -- longer messages are passed via RDMA and bypass the buffer sizes specified in receive_queues. See http://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.2 and http://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3 .
Hope this helps!

--
Jeff Squyres
Cisco Systems

Reply via email to