Re: [OMPI users] btl_openib_connect_oob.c:459:qp_create_one:errorcreating qp

Jeff Squyres Wed, 1 Jul 2009 08:56:55 -0400

On Jul 1, 2009, at 8:01 AM, Jeff Squyres (jsquyres) wrote:

Random thought: would it be easy for the output of cat /dev/knem to
indicate whether IOAT hardware is present?


Well *that* was replying to the wrong message.  :-)

A real reply is below...

> I have problems running large jobs on a PC cluster with OpenMPIV1.3.> Typically the error appears only for processor count >= 2048(actually

> cores), sometimes also bellow.
>
> The nodes (Intel Nehalem, 2 procs, 4 cores each) run (scientific?)
> linux.
> $> uname -a
> Linux cl3fr1 2.6.18-128.1.10.el5 #1 SMP Thu May 7 12:48:13 EDT 2009
> x86_64 x86_64 x86_64 GNU/Linux
>

> The code starts normally, reads it's input data sets (~4GB), doessome

> initialization and then continues the actual calculations. So time
> after that, it fails with the following error message:
>
> [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
> btl_openib_connect_oob.c:459:qp_create_one]
> error creating qp errno says Cannot allocate memory

What kind of communication pattern does the application use? Does ituse all-to-all? Open MPI makes OpenFabrics verbs (i.e., IB in yourcase) connections lazily, so you might not see these problems untilOMPI is trying to make connections -- well past MPI_INIT -- and thenfailing when it runs out of HCA QP resources.

> Memory usage by the application should not be the problem. At this
> proc
> count, the code uses only ~100MB per proc. Also, the code runs for
> lower
> number of procs where it consumes more mem.
>
>
> I also get the apparently secondary error messages:
>
> [n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv:
> readv
> failed: Connection reset by peer (104)

This error is likely caused by the first error -- processes fail andthen TCP connections get reset, causing the readv() errors. Perhapswe should have a better error message for this...

> The cluster uses InfiniBand connections. I am aware only of the
> following parameter changes (systemwide):
> btl_openib_ib_min_rnr_timer = 25
> btl_openib_ib_timeout = 20
>
> $> ulimit -l
> unlimited
>
>
> I attached:
> 1) $> ompi_info --all > ompi_info.log
> 2) stderr from the PBS: stderr.log

Open MPI v1.3 is actually quite flexible in how it creates and usesOpenFabrics QPs. By default, it likely creates 4 QPs (of varyingbuffer sizes) between each pair of MPI processes that connect to eachother. You can tune this down to be 3, 2, or even 1 QP to reduce thenumber of QPs that are being opened (and therefore, presumably, notexhaust HCA QP resources).

Alternatively / additionally, you may wish to enable XRC if you haverecent enough Mellanox HCAs. This should also save on QP resources.

You can set both of these things via the mca_btl_openib_receive_queuesMCA parameter. It takes a colon-delimited list of receive queues(which directly imply how many QP's to create). There are 3 kinds ofentries: per-peer QPs, shared receive queues, and XRC receive queues.Here's a description of each:


Per-peer receive queues require between 2 and 5 parameters:

  1. Buffer size in bytes (mandatory)
  2. Number of buffers (mandatory)

3. Low buffer count watermark (optional; defaults to (num_buffers /2))

  4. Credit window size (optional; defaults to (low_watermark / 2))
  5. Number of buffers reserved for credit messages (optional;
     defaults to (num_buffers*2-1)/credit_window)

  Example: P,128,256,128,16
  - 128 byte buffers
  - 256 buffers to receive incoming MPI messages
  - When the number of available buffers reaches 128, re-post 128 more
    buffers to reach a total of 256
  - If the number of available credits reaches 16, send an explicit
    credit message to the sender
  - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
    reserved for explicit credit messages

Shared receive queues can take between 2 and 4 parameters:

  1. Buffer size in bytes (mandatory)
  2. Number of buffers (mandatory)

3. Low buffer count watermark (optional; defaults to (num_buffers /2))

  4. Maximum number of outstanding sends a sender can have (optional;
     defaults to (low_watermark / 4)

  Example: S,1024,256,128,32
  - 1024 byte buffers
  - 256 buffers to receive incoming MPI messages
  - When the number of available buffers reaches 128, re-post 128 more
    buffers to reach a total of 256
  - A sender will not send to a peer unless it has less than 32
    outstanding sends to that peer.

I believe that XRC receive queues are exactly the same as SRQs butwith "X" instead of "S". If you use XRC, you can *only* specify XRCreceive queues -- you cannot also specify PP or SRQ receive queues.Mellanox may fix that someday, but this restriction currently holdsfor the 1.3 OMPI series.

The default value of btl_openib_receive_queues is likely to be set bythe $prefix/share/openmpi/mca-btl-openib-device-params.ini file. Lookin that file for your specific HCA device and see the "receive_queues"value set for it. You can override this value on the mpirun commandline or by editing this file (make sure to edit this file on everynode!). If your device does not have a receive_queues value in thatfile, you can look up the default value with ompi_info:


$ ompi_info --param btl openib --parsable | grep receive_queues

mca:btl:openib:param:btl_openib_receive_queues:value: P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32mca:btl:openib:param:btl_openib_receive_queues:data_source: defaultvalue

mca:btl:openib:param:btl_openib_receive_queues:status: writable

mca:btl:openib:param:btl_openib_receive_queues:help: Colon-delimited,comma delimited list of receive queues: P,4096,8,6,4:P,32768,8,6,4

mca:btl:openib:param:btl_openib_receive_queues:deprecated: no

You can see that in my setup, receive_queues defaults to 4 QPs: 1 PPfor small messages (256 buffers of size 128) and then 3 SRQs ofincreasing buffer size.

More specifically, if you know the exact message sizes in yourapplication, you can tune the receive_queues value to exactly fit yourmessages and get a very high degree of registered memory utilization.In the default case above, Open MPI posts a truckload of short messagebuffers and progressively smaller number of larger message buffers.This allows for lots and lots of short message resources (e.g., highmessage injection/reception rates) while balancing a few longermessage resources.

Keep in mind that Open MPI only uses these receive_queues value forshort/medium messages -- longer messages are passed via RDMA andbypass the buffer sizes specified in receive_queues. See http://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.2and http://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3.

Hope this helps!

--
Jeff Squyres
Cisco Systems

Re: [OMPI users] btl_openib_connect_oob.c:459:qp_create_one:errorcreating qp

Reply via email to