Thanks for the pointer!

Do you know if these sizes are dependent on the hardware?

Thanks,
Allen

On Tue, 2010-08-03 at 10:29 -0400, Terry Dontje wrote:
> Sorry, I didn't see your prior question glad you found the
> btl_openib_receive_queues parameter.  There is not a faq entry for
> this but I found the following in the openib btl help file that spells
> out the parameters when using Per-peer receive queue (ie receive queue
> setting with "P" as the first argument).
> 
> Per-peer receive queues require between 2 and 5 parameters:
> 
>  1. Buffer size in bytes (mandatory)
>  2. Number of buffers (mandatory)
>  3. Low buffer count watermark (optional; defaults to (num_buffers /
> 2))
>  4. Credit window size (optional; defaults to (low_watermark / 2))
>  5. Number of buffers reserved for credit messages (optional;
>      defaults to (num_buffers*2-1)/credit_window)
> 
>  Example: P,128,256,128,16
>   - 128 byte buffers
>   - 256 buffers to receive incoming MPI messages
>   - When the number of available buffers reaches 128, re-post 128 more
>     buffers to reach a total of 256
>   - If the number of available credits reaches 16, send an explicit
>     credit message to the sender
>   - Defaulting to ((256 * 2) - 1) / 16 = 31; this many buffers are
>     reserved for explicit credit messages
> 
> --td
> Allen Barnett wrote: 
> > Hi: In response to my own question, by studying the file
> > mca-btl-openib-device-params.ini, I discovered that this option in
> > OMPI-1.4.2:
> > 
> > -mca btl_openib_receive_queues P,65536,256,192,128
> > 
> > was sufficient to prevent OMPI from trying to create shared receive
> > queues and allowed my application to run to completion using the IB
> > hardware.
> > 
> > I guess my question now is: What do these numbers mean? Presumably the
> > size (or counts?) of buffers to allocate? Are there limits or a way to
> > tune these values?
> > 
> > Thanks,
> > Allen
> > 
> > On Mon, 2010-08-02 at 12:49 -0400, Allen Barnett wrote:
> >   
> > > Hi Terry:
> > > It is indeed the case that the openib BTL has not been initialized. I
> > > ran with your tcp-disabled MCA option and it aborted in MPI_Init.
> > > 
> > > The OFED stack is what's included in RHEL4. It appears to be made up of
> > > the RPMs:
> > > openib-1.4-1.el4
> > > opensm-3.2.5-1.el4
> > > libibverbs-1.1.2-1.el4
> > > 
> > > How can I determine if srq is supported? Is there an MCA option to
> > > defeat it? (Our in-house cluster has more recent Mellanox IB hardware
> > > and is running this same IB stack and ompi 1.4.2 works OK, so I suspect
> > > srq is supported by the OpenFabrics stack. Perhaps.)
> > > 
> > > Thanks,
> > > Allen
> > > 
> > > On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote:
> > >     
> > > > My guess is from the message below saying "(openib) BTL failed to
> > > > initialize"  that the code is probably running over tcp.  To
> > > > absolutely prove this you can specify to only use the openib, sm and
> > > > self btls to eliminate the tcp btl.  To do that you add the following
> > > > to the mpirun line "-mca btl openib,sm,self".  I believe with that
> > > > specification the code will abort and not run to completion.  
> > > > 
> > > > What version of the OFED stack are you using?  I wonder if srq is
> > > > supported on your system or not?
> > > > 
> > > > --td
> > > > 
> > > > Allen Barnett wrote: 
> > > >       
> > > > > Hi: A customer is attempting to run our OpenMPI 1.4.2-based 
> > > > > application
> > > > > on a cluster of machines running RHEL4 with the standard OFED stack. 
> > > > > The
> > > > > HCAs are identified as:
> > > > > 
> > > > > 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1)
> > > > > 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
> > > > > 
> > > > > ibv_devinfo says that one port on the HCAs is active but the other is
> > > > > down:
> > > > > 
> > > > > hca_id:       mthca0
> > > > >       fw_ver:                         3.0.2
> > > > >       node_guid:                      0006:6a00:9800:4c78
> > > > >       sys_image_guid:                 0006:6a00:9800:4c78
> > > > >       vendor_id:                      0x066a
> > > > >       vendor_part_id:                 23108
> > > > >       hw_ver:                         0xA1
> > > > >       phys_port_cnt:                  2
> > > > >               port:   1
> > > > >                       state:                  active (4)
> > > > >                       max_mtu:                2048 (4)
> > > > >                       active_mtu:             2048 (4)
> > > > >                       sm_lid:                 1
> > > > >                       port_lid:               26
> > > > >                       port_lmc:               0x00
> > > > > 
> > > > >               port:   2
> > > > >                       state:                  down (1)
> > > > >                       max_mtu:                2048 (4)
> > > > >                       active_mtu:             512 (2)
> > > > >                       sm_lid:                 0
> > > > >                       port_lid:               0
> > > > >                       port_lmc:               0x00
> > > > > 
> > > > > 
> > > > >  When the OMPI application is run, it prints the error message:
> > > > > 
> > > > > --------------------------------------------------------------------
> > > > > The OpenFabrics (openib) BTL failed to initialize while trying to
> > > > > create an internal queue.  This typically indicates a failed
> > > > > OpenFabrics installation, faulty hardware, or that Open MPI is
> > > > > attempting to use a feature that is not supported on your hardware
> > > > > (i.e., is a shared receive queue specified in the
> > > > > btl_openib_receive_queues MCA parameter with a device that does not
> > > > > support it?).  The failure occured here:
> > > > > 
> > > > >   Local host:  machine001.lan
> > > > >   OMPI
> > > > > source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250
> > > > >   Function:    ibv_create_srq()
> > > > >   Error:       Invalid argument (errno=22)
> > > > >   Device:      mthca0
> > > > > 
> > > > > You may need to consult with your system administrator to get this
> > > > > problem fixed.
> > > > > --------------------------------------------------------------------
> > > > > 
> > > > > The full log of a run with "btl_openib_verbose 1" is attached. My
> > > > > application appears to run to completion, but I can't tell if it's 
> > > > > just
> > > > > running on TCP and not using the IB hardware.
> > > > > 
> > > > > I would appreciate any suggestions on how to proceed to fix this 
> > > > > error.
> > > > > 
> > > > > Thanks,
> > > > > Allen
> 

-- 
Allen Barnett
Transpire, Inc
E-Mail: al...@transpireinc.com

Reply via email to