Have you tried using the UCX PML?

The UCX PML is Mellanox's preferred Open MPI mechanism (instead of using the 
openib BTL).


> On Nov 13, 2019, at 9:35 AM, Matteo Guglielmi via users 
> <users@lists.open-mpi.org> wrote:
> 
> I rolled everything back to stock centos 7.7 installing OFED via:
> 
> 
> 
> 
> yum groupinstall @infiniband
> 
> yum install rdma-core-devel infiniband-diags-devel
> 
> 
> which does not install the ofed_info command, or at least I could
> not find it (do you know where it is?).
> 
> 
> 
> openmpi is version 3.1.4
> 
> 
> 
> 
> the fw version should be 8.37.7.0
> 
> 
> 
> will now try to upgrade the firmware since changing OS is not an option.
> 
> 
> 
> Other suggestions?
> 
> 
> Thank you!
> 
> 
> ________________________________
> From: Llolsten Kaonga <l...@soft-forge.com>
> Sent: Wednesday, November 13, 2019 3:25:16 PM
> To: 'Open MPI Users'
> Cc: Matteo Guglielmi
> Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for 
> device.
> 
> Hello Mateo,
> 
> What version of openmpi are you running?
> 
> Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It
> supports CentsOS 7.6.
> 
> Apologies if you have already tried CentOS 7.6.
> 
> We have been able to run openmpi (earlier this month):
> 
> OS:                      CentOS 7.6
> mpirun --version:        3.1.4
> ofed_info -s:            OFED-4.17-1
> 
> RNIC fw version          8.50.9.0
> 
> Thanks.
> --
> Llolsten
> 
> -----Original Message-----
> From: users <users-boun...@lists.open-mpi.org> On Behalf Of Matteo Guglielmi
> via users
> Sent: Wednesday, November 13, 2019 2:12 AM
> To: users@lists.open-mpi.org
> Cc: Matteo Guglielmi <matteo.guglie...@dalco.ch>
> Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for
> device.
> 
> I'm trying to get openmpi over RoCE working with this setup:
> 
> 
> 
> 
> card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov
> 
> 
> OS: CentOS 7.7
> 
> 
> modinfo qede
> 
> filename:
> /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q
> ede/qede.ko.xz
> version:        8.37.0.20
> license:        GPL
> description:    QLogic FastLinQ 4xxxx Ethernet Driver
> retpoline:      Y
> rhelversion:    7.7
> srcversion:     A6AFD0788918644F2EFFF31
> alias:          pci:v00001077d00008090sv*sd*bc*sc*i*
> alias:          pci:v00001077d00008070sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001664sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001656sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001654sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001644sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001636sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001666sv*sd*bc*sc*i*
> alias:          pci:v00001077d00001634sv*sd*bc*sc*i*
> depends:        ptp,qed
> intree:         Y
> vermagic:       3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
> signer:         CentOS Linux kernel signing key
> sig_key:        60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
> sig_hashalgo:   sha256
> parm:           debug: Default debug msglevel (uint)
> 
> modinfo qedr
> 
> filename:
> /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe
> dr.ko.xz
> license:        Dual BSD/GPL
> author:         QLogic Corporation
> description:    QLogic 40G/100G ROCE Driver
> retpoline:      Y
> rhelversion:    7.7
> srcversion:     B5B65473217AA2B1F2F619B
> depends:        qede,qed,ib_core
> intree:         Y
> vermagic:       3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
> signer:         CentOS Linux kernel signing key
> sig_key:        60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
> sig_hashalgo:   sha256
> 
> ibv_devinfo
> 
> hca_id: qedr0
> transport: InfiniBand (0)
> fw_ver: 8.37.7.0
> node_guid: b62e:99ff:fea7:8439
> sys_image_guid: b62e:99ff:fea7:8439
> vendor_id: 0x1077
> vendor_part_id: 32880
> hw_ver: 0x0
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> link_layer: Ethernet
> 
> hca_id: qedr1
> transport: InfiniBand (0)
> fw_ver: 8.37.7.0
> node_guid: b62e:99ff:fea7:843a
> sys_image_guid: b62e:99ff:fea7:843a
> vendor_id: 0x1077
> vendor_part_id: 32880
> hw_ver: 0x0
> phys_port_cnt: 1
> port: 1
> state: PORT_DOWN (1)
> max_mtu: 4096 (5)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> link_layer: Ethernet
> 
> 
> 
> 
> RDMA actually works at system level which means that I cand do
> 
> rdma ping-pong tests etc.
> 
> 
> 
> But when I try to run openmpi with these options:
> 
> 
> 
> mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...
> 
> 
> 
> 
> 
> I get the following error messages:
> 
> 
> 
> 
> --------------------------------------------------------------------------
> WARNING: There is at least non-excluded one OpenFabrics device found, but
> there are no active ports detected (or Open MPI was unable to use them).
> This is most certainly not what you wanted.  Check your cables, subnet
> manager configuration, etc.  The openib BTL will be ignored for this job.
> 
>  Local host: node001
> --------------------------------------------------------------------------
> qelr_alloc_context: Failed to allocate context for device.
> qelr_alloc_context: Failed to allocate context for device.
> qelr_alloc_context: Failed to allocate context for device.
> qelr_alloc_context: Failed to allocate context for device.
> qelr_alloc_context: Failed to allocate context for device.
> --------------------------------------------------------------------------
> No OpenFabrics connection schemes reported that they were able to be used on
> a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>  Local host:           node002
>  Local device:         qedr0
>  Local port:           1
>  CPCs attempted:       rdmacm
> --------------------------------------------------------------------------
> qelr_alloc_context: Failed to allocate context for device.
> qelr_alloc_context: Failed to allocate context for device.
> 
> ...
> 
> 
> I've tried several things such as:
> 
> 
> 1) upgrade the 3.10 kernel's qed* drivers to the latest stable version
> 8.42.9
> 
> 2) upgrade the CentOS kernel from 3.10 to 5.3 via elrepo
> 
> 3) install the latest OFED-4.17-1.tgz stack
> 
> 
> but the error messages never go away ad do remain always the same.
> 
> 
> 
> Any advice is highly appreciated.
> 
> 
> 


-- 
Jeff Squyres
jsquy...@cisco.com

Reply via email to