Hi Terry:
It is indeed the case that the openib BTL has not been initialized. I
ran with your tcp-disabled MCA option and it aborted in MPI_Init.

The OFED stack is what's included in RHEL4. It appears to be made up of
the RPMs:
openib-1.4-1.el4
opensm-3.2.5-1.el4
libibverbs-1.1.2-1.el4

How can I determine if srq is supported? Is there an MCA option to
defeat it? (Our in-house cluster has more recent Mellanox IB hardware
and is running this same IB stack and ompi 1.4.2 works OK, so I suspect
srq is supported by the OpenFabrics stack. Perhaps.)

Thanks,
Allen

On Mon, 2010-08-02 at 06:47 -0400, Terry Dontje wrote:
> My guess is from the message below saying "(openib) BTL failed to
> initialize"  that the code is probably running over tcp.  To
> absolutely prove this you can specify to only use the openib, sm and
> self btls to eliminate the tcp btl.  To do that you add the following
> to the mpirun line "-mca btl openib,sm,self".  I believe with that
> specification the code will abort and not run to completion.  
> 
> What version of the OFED stack are you using?  I wonder if srq is
> supported on your system or not?
> 
> --td
> 
> Allen Barnett wrote: 
> > Hi: A customer is attempting to run our OpenMPI 1.4.2-based application
> > on a cluster of machines running RHEL4 with the standard OFED stack. The
> > HCAs are identified as:
> > 
> > 03:01.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1)
> > 04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
> > 
> > ibv_devinfo says that one port on the HCAs is active but the other is
> > down:
> > 
> > hca_id:     mthca0
> >     fw_ver:                         3.0.2
> >     node_guid:                      0006:6a00:9800:4c78
> >     sys_image_guid:                 0006:6a00:9800:4c78
> >     vendor_id:                      0x066a
> >     vendor_part_id:                 23108
> >     hw_ver:                         0xA1
> >     phys_port_cnt:                  2
> >             port:   1
> >                     state:                  active (4)
> >                     max_mtu:                2048 (4)
> >                     active_mtu:             2048 (4)
> >                     sm_lid:                 1
> >                     port_lid:               26
> >                     port_lmc:               0x00
> > 
> >             port:   2
> >                     state:                  down (1)
> >                     max_mtu:                2048 (4)
> >                     active_mtu:             512 (2)
> >                     sm_lid:                 0
> >                     port_lid:               0
> >                     port_lmc:               0x00
> > 
> > 
> >  When the OMPI application is run, it prints the error message:
> > 
> > --------------------------------------------------------------------
> > The OpenFabrics (openib) BTL failed to initialize while trying to
> > create an internal queue.  This typically indicates a failed
> > OpenFabrics installation, faulty hardware, or that Open MPI is
> > attempting to use a feature that is not supported on your hardware
> > (i.e., is a shared receive queue specified in the
> > btl_openib_receive_queues MCA parameter with a device that does not
> > support it?).  The failure occured here:
> > 
> >   Local host:  machine001.lan
> >   OMPI
> > source: /software/openmpi-1.4.2/ompi/mca/btl/openib/btl_openib.c:250
> >   Function:    ibv_create_srq()
> >   Error:       Invalid argument (errno=22)
> >   Device:      mthca0
> > 
> > You may need to consult with your system administrator to get this
> > problem fixed.
> > --------------------------------------------------------------------
> > 
> > The full log of a run with "btl_openib_verbose 1" is attached. My
> > application appears to run to completion, but I can't tell if it's just
> > running on TCP and not using the IB hardware.
> > 
> > I would appreciate any suggestions on how to proceed to fix this error.
> > 
> > Thanks,
> > Allen
> 

-- 
Allen Barnett
Transpire, Inc
E-Mail: al...@transpireinc.com

Reply via email to