Hi: A customer is attempting to run our OpenMPI 1.4.2-based application on a cluster of machines running RHEL4 with the standard OFED stack. The HCAs are identified as:
03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) ibv_devinfo says that one port on the HCAs is active but the other is down: hca_id: mthca0 fw_ver: 3.0.2 node_guid: 0006:6a00:9800:4c78 sys_image_guid: 0006:6a00:9800:4c78 vendor_id: 0x066a vendor_part_id: 23108 hw_ver: 0xA1 phys_port_cnt: 2 port: 1 state: active (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 26 port_lmc: 0x00 port: 2 state: down (1) max_mtu: 2048 (4) active_mtu: 512 (2) sm_lid: 0 port_lid: 0 port_lmc: 0x00 When the OMPI application is run, it prints the error message: -------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to initialize while trying to create an internal queue. This typically indicates a failed OpenFabrics installation, faulty hardware, or that Open MPI is attempting to use a feature that is not supported on your hardware (i.e., is a shared receive queue specified in the btl_openib_receive_queues MCA parameter with a device that does not support it?). The failure occured here: Local host: machine001.lan OMPI source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250 Function: ibv_create_srq() Error: Invalid argument (errno=22) Device: mthca0 You may need to consult with your system administrator to get this problem fixed. -------------------------------------------------------------------- The full log of a run with "btl_openib_verbose 1" is attached. My application appears to run to completion, but I can't tell if it's just running on TCP and not using the IB hardware. I would appreciate any suggestions on how to proceed to fix this error. Thanks, Allen -- Allen Barnett Transpire, Inc E-Mail: al...@transpireinc.com
openib.listing.gz
Description: GNU Zip compressed data