Have you tried using the UCX PML? The UCX PML is Mellanox's preferred Open MPI mechanism (instead of using the openib BTL).
> On Nov 13, 2019, at 9:35 AM, Matteo Guglielmi via users > <users@lists.open-mpi.org> wrote: > > I rolled everything back to stock centos 7.7 installing OFED via: > > > > > yum groupinstall @infiniband > > yum install rdma-core-devel infiniband-diags-devel > > > which does not install the ofed_info command, or at least I could > not find it (do you know where it is?). > > > > openmpi is version 3.1.4 > > > > > the fw version should be 8.37.7.0 > > > > will now try to upgrade the firmware since changing OS is not an option. > > > > Other suggestions? > > > Thank you! > > > ________________________________ > From: Llolsten Kaonga <l...@soft-forge.com> > Sent: Wednesday, November 13, 2019 3:25:16 PM > To: 'Open MPI Users' > Cc: Matteo Guglielmi > Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for > device. > > Hello Mateo, > > What version of openmpi are you running? > > Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It > supports CentsOS 7.6. > > Apologies if you have already tried CentOS 7.6. > > We have been able to run openmpi (earlier this month): > > OS: CentOS 7.6 > mpirun --version: 3.1.4 > ofed_info -s: OFED-4.17-1 > > RNIC fw version 8.50.9.0 > > Thanks. > -- > Llolsten > > -----Original Message----- > From: users <users-boun...@lists.open-mpi.org> On Behalf Of Matteo Guglielmi > via users > Sent: Wednesday, November 13, 2019 2:12 AM > To: users@lists.open-mpi.org > Cc: Matteo Guglielmi <matteo.guglie...@dalco.ch> > Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for > device. > > I'm trying to get openmpi over RoCE working with this setup: > > > > > card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov > > > OS: CentOS 7.7 > > > modinfo qede > > filename: > /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q > ede/qede.ko.xz > version: 8.37.0.20 > license: GPL > description: QLogic FastLinQ 4xxxx Ethernet Driver > retpoline: Y > rhelversion: 7.7 > srcversion: A6AFD0788918644F2EFFF31 > alias: pci:v00001077d00008090sv*sd*bc*sc*i* > alias: pci:v00001077d00008070sv*sd*bc*sc*i* > alias: pci:v00001077d00001664sv*sd*bc*sc*i* > alias: pci:v00001077d00001656sv*sd*bc*sc*i* > alias: pci:v00001077d00001654sv*sd*bc*sc*i* > alias: pci:v00001077d00001644sv*sd*bc*sc*i* > alias: pci:v00001077d00001636sv*sd*bc*sc*i* > alias: pci:v00001077d00001666sv*sd*bc*sc*i* > alias: pci:v00001077d00001634sv*sd*bc*sc*i* > depends: ptp,qed > intree: Y > vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions > signer: CentOS Linux kernel signing key > sig_key: 60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F > sig_hashalgo: sha256 > parm: debug: Default debug msglevel (uint) > > modinfo qedr > > filename: > /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe > dr.ko.xz > license: Dual BSD/GPL > author: QLogic Corporation > description: QLogic 40G/100G ROCE Driver > retpoline: Y > rhelversion: 7.7 > srcversion: B5B65473217AA2B1F2F619B > depends: qede,qed,ib_core > intree: Y > vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions > signer: CentOS Linux kernel signing key > sig_key: 60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F > sig_hashalgo: sha256 > > ibv_devinfo > > hca_id: qedr0 > transport: InfiniBand (0) > fw_ver: 8.37.7.0 > node_guid: b62e:99ff:fea7:8439 > sys_image_guid: b62e:99ff:fea7:8439 > vendor_id: 0x1077 > vendor_part_id: 32880 > hw_ver: 0x0 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 1024 (3) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: Ethernet > > hca_id: qedr1 > transport: InfiniBand (0) > fw_ver: 8.37.7.0 > node_guid: b62e:99ff:fea7:843a > sys_image_guid: b62e:99ff:fea7:843a > vendor_id: 0x1077 > vendor_part_id: 32880 > hw_ver: 0x0 > phys_port_cnt: 1 > port: 1 > state: PORT_DOWN (1) > max_mtu: 4096 (5) > active_mtu: 1024 (3) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: Ethernet > > > > > RDMA actually works at system level which means that I cand do > > rdma ping-pong tests etc. > > > > But when I try to run openmpi with these options: > > > > mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ... > > > > > > I get the following error messages: > > > > > -------------------------------------------------------------------------- > WARNING: There is at least non-excluded one OpenFabrics device found, but > there are no active ports detected (or Open MPI was unable to use them). > This is most certainly not what you wanted. Check your cables, subnet > manager configuration, etc. The openib BTL will be ignored for this job. > > Local host: node001 > -------------------------------------------------------------------------- > qelr_alloc_context: Failed to allocate context for device. > qelr_alloc_context: Failed to allocate context for device. > qelr_alloc_context: Failed to allocate context for device. > qelr_alloc_context: Failed to allocate context for device. > qelr_alloc_context: Failed to allocate context for device. > -------------------------------------------------------------------------- > No OpenFabrics connection schemes reported that they were able to be used on > a specific port. As such, the openib BTL (OpenFabrics > support) will be disabled for this port. > > Local host: node002 > Local device: qedr0 > Local port: 1 > CPCs attempted: rdmacm > -------------------------------------------------------------------------- > qelr_alloc_context: Failed to allocate context for device. > qelr_alloc_context: Failed to allocate context for device. > > ... > > > I've tried several things such as: > > > 1) upgrade the 3.10 kernel's qed* drivers to the latest stable version > 8.42.9 > > 2) upgrade the CentOS kernel from 3.10 to 5.3 via elrepo > > 3) install the latest OFED-4.17-1.tgz stack > > > but the error messages never go away ad do remain always the same. > > > > Any advice is highly appreciated. > > > -- Jeff Squyres jsquy...@cisco.com