Hi: In response to my own question, by studying the file mca-btl-openib-device-params.ini, I discovered that this option in OMPI-1.4.2:
-mca btl_openib_receive_queues P,65536,256,192,128 was sufficient to prevent OMPI from trying to create shared receive queues and allowed my application to run to completion using the IB hardware. I guess my question now is: What do these numbers mean? Presumably the size (or counts?) of buffers to allocate? Are there limits or a way to tune these values? Thanks, Allen On Mon, 2010-08-02 at 12:49 -0400, Allen Barnett wrote: > Hi Terry: > It is indeed the case that the openib BTL has not been initialized. I > ran with your tcp-disabled MCA option and it aborted in MPI_Init. > > The OFED stack is what's included in RHEL4. It appears to be made up of > the RPMs: > openib-1.4-1.el4 > opensm-3.2.5-1.el4 > libibverbs-1.1.2-1.el4 > > How can I determine if srq is supported? Is there an MCA option to > defeat it? (Our in-house cluster has more recent Mellanox IB hardware > and is running this same IB stack and ompi 1.4.2 works OK, so I suspect > srq is supported by the OpenFabrics stack. Perhaps.) > > Thanks, > Allen > > On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote: > > My guess is from the message below saying "(openib) BTL failed to > > initialize" that the code is probably running over tcp. To > > absolutely prove this you can specify to only use the openib, sm and > > self btls to eliminate the tcp btl. To do that you add the following > > to the mpirun line "-mca btl openib,sm,self". I believe with that > > specification the code will abort and not run to completion. > > > > What version of the OFED stack are you using? I wonder if srq is > > supported on your system or not? > > > > --td > > > > Allen Barnett wrote: > > > Hi: A customer is attempting to run our OpenMPI 1.4.2-based application > > > on a cluster of machines running RHEL4 with the standard OFED stack. The > > > HCAs are identified as: > > > > > > 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) > > > 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) > > > > > > ibv_devinfo says that one port on the HCAs is active but the other is > > > down: > > > > > > hca_id: mthca0 > > > fw_ver: 3.0.2 > > > node_guid: 0006:6a00:9800:4c78 > > > sys_image_guid: 0006:6a00:9800:4c78 > > > vendor_id: 0x066a > > > vendor_part_id: 23108 > > > hw_ver: 0xA1 > > > phys_port_cnt: 2 > > > port: 1 > > > state: active (4) > > > max_mtu: 2048 (4) > > > active_mtu: 2048 (4) > > > sm_lid: 1 > > > port_lid: 26 > > > port_lmc: 0x00 > > > > > > port: 2 > > > state: down (1) > > > max_mtu: 2048 (4) > > > active_mtu: 512 (2) > > > sm_lid: 0 > > > port_lid: 0 > > > port_lmc: 0x00 > > > > > > > > > When the OMPI application is run, it prints the error message: > > > > > > -------------------------------------------------------------------- > > > The OpenFabrics (openib) BTL failed to initialize while trying to > > > create an internal queue. This typically indicates a failed > > > OpenFabrics installation, faulty hardware, or that Open MPI is > > > attempting to use a feature that is not supported on your hardware > > > (i.e., is a shared receive queue specified in the > > > btl_openib_receive_queues MCA parameter with a device that does not > > > support it?). The failure occured here: > > > > > > Local host: machine001.lan > > > OMPI > > > source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250 > > > Function: ibv_create_srq() > > > Error: Invalid argument (errno=22) > > > Device: mthca0 > > > > > > You may need to consult with your system administrator to get this > > > problem fixed. > > > -------------------------------------------------------------------- > > > > > > The full log of a run with "btl_openib_verbose 1" is attached. My > > > application appears to run to completion, but I can't tell if it's just > > > running on TCP and not using the IB hardware. > > > > > > I would appreciate any suggestions on how to proceed to fix this error. > > > > > > Thanks, > > > Allen > > > -- Allen Barnett Transpire, Inc E-Mail: al...@transpireinc.com Skype: allenbarnett Ph: 518-887-2930