[OMPI users] openMPI 1.1.4 - connect() failed with errno=111
Since I've installed openmpi I cannot submit any job that uses cpus from different machines. ### hostfile ### lcbcpc02.epfl.ch slots=4 max-slots=4 lcbcpc04.epfl.ch slots=4 max-slots=4 ### error message ### [matteo@lcbcpc02 TEST]$ mpirun --hostfile ~matteo/hostfile -np 8 /home/matteo/Software/NWChem/5.0/bin/nwchem ./nwchem.nw [0,1,5][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] [0,1,6][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=111 6: lcbcpc04.epfl.ch len=16 [0,1,4][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=111 4: lcbcpc04.epfl.ch len=16 [0,1,7][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=111 7: lcbcpc04.epfl.ch len=16 connect() failed with errno=111 5: lcbcpc04.epfl.ch len=16 # I did disable the firewall on both machines but I still get that error message. Thanks, MG.
Re: [OMPI users] openMPI 1.1.4 - connect() failed with errno=111
This is the ifconfig output from the machine I'm used to submit the parallel job: ### ifconfig output - master node ### [root@lcbcpc02 ~]# ifconfig eth0 Link encap:Ethernet HWaddr 00:15:17:10:53:C8 inet addr:128.178.54.74 Bcast:128.178.54.255 Mask:255.255.255.0 inet6 addr: fe80::215:17ff:fe10:53c8/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:11563938 errors:0 dropped:0 overruns:0 frame:0 TX packets:6670398 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:16562149093 (15.4 GiB) TX bytes:1312532185 (1.2 GiB) Base address:0x2020 Memory:c282-c284 eth1 Link encap:Ethernet HWaddr 00:15:17:10:53:C9 inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::215:17ff:fe10:53c9/64 Scope:Link UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Base address:0x2000 Memory:c280-c282 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:468156 errors:0 dropped:0 overruns:0 frame:0 TX packets:468156 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:500286061 (477.1 MiB) TX bytes:500286061 (477.1 MiB) This is the ifconfig output from the "slave node": ### ifconfig output - slave node ### [root@lcbcpc04 ~]# ifconfig eth0 Link encap:Ethernet HWaddr 00:15:17:10:53:74 inet addr:128.178.54.76 Bcast:128.178.54.255 Mask:255.255.255.0 inet6 addr: fe80::215:17ff:fe10:5374/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:320264 errors:0 dropped:0 overruns:0 frame:0 TX packets:151942 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:139280839 (132.8 MiB) TX bytes:82889237 (79.0 MiB) Base address:0x2020 Memory:c282-c284 eth1 Link encap:Ethernet HWaddr 00:15:17:10:53:75 inet addr:192.168.0.1 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::215:17ff:fe10:5375/64 Scope:Link UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Base address:0x2000 Memory:c280-c282 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:2820 errors:0 dropped:0 overruns:0 frame:0 TX packets:2820 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2178053 (2.0 MiB) TX bytes:2178053 (2.0 MiB) Thanks Jeff!!! Jeff Squyres wrote: > I'm assuming that these are Linux hosts. If so, errno 111 is > "connection refused" possibly meaning that there is still some > firewall active or the wrong interface is being used to establish > connections between these machines. > > Can you send the output of "ifconfig" (might be /sbin/ifconfig on > your machine?) from both machines? > > > On Feb 11, 2007, at 3:45 PM, matteo.guglie...@epfl.ch wrote: > > >> Since I've installed openmpi I cannot submit any job that uses cpus >> from >> different machines. >> >> ### hostfile ### >> lcbcpc02.epfl.ch slots=4 max-slots=4 >> lcbcpc04.epfl.ch slots=4 max-slots=4 >> >> >> ### error message ### >> [matteo@lcbcpc02 TEST]$ mpirun --hostfile ~matteo/hostfile -np 8 >> /home/matteo/Software/NWChem/5.0/bin/nwchem ./nwchem.nw >> [0,1,5][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] >> [0,1,6][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] >> connect() failed with errno=111 >> 6: lcbcpc04.epfl.ch len=16 >> [0,1,4][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] >> connect() failed with errno=111 >> 4: lcbcpc04.epfl.ch len=16 >> [0,1,7][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] >> connect() failed with errno=111 >> 7: lcbcpc04.epfl.ch len=16 >> connect() failed with errno=111 >> 5: lcbcpc04.epfl.ch len=16 >> # >> >> I did disable the firewall on both machines but I still get that >> error message. >> >> Thanks, >> MG. >> ___ >> users mailing list >> us...
Re: [OMPI users] openMPI 1.1.4 - connect() failed with errno=111
Jeff Squyres wrote: > On Feb 12, 2007, at 12:54 PM, Matteo Guglielmi wrote: > > >> This is the ifconfig output from the machine I'm used to submit the >> parallel job: >> > > It looks like both of your nodes share an IP address: > > >> [root@lcbcpc02 ~]# ifconfig >> eth1 Link encap:Ethernet HWaddr 00:15:17:10:53:C9 >> inet addr:192.168.0.1 Bcast:192.168.0.255 Mask: >> 255.255.255.0 >> [root@lcbcpc04 ~]# ifconfig >> eth1 Link encap:Ethernet HWaddr 00:15:17:10:53:75 >> inet addr:192.168.0.1 Bcast:192.168.0.255 Mask: >> 255.255.255.0 >> > > This will be problematic to more than just OMPI if these two > interfaces are on the same network. The solution is to ensure that > all your nodes have unique IP addresses. > > If these NICs are on different networks, than it's a valid network > configuration, but Open MPI (by default) will assume that these are > routable to each other. You can tell Open MPI to not use eth1 in > this case -- see this FAQ entries for details: > >http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network >http://www.open-mpi.org/faq/?category=tcp#tcp-selection >http://www.open-mpi.org/faq/?category=tcp#tcp-routability > > Those nic "eth1" are not connected at all... all the machines use only the eth0 interface which have different IP for each PC. Anyway you solved my problem suggesting me those FAQ entries!!! *--mca btl_tcp_if_exclude lo,eth1 that's the magic option which works for me!!! * Thanks Jeff!!! Thanks MG.
Re: [OMPI users] openMPI 1.1.4 - connect() failed with errno=111
Jeff Squyres wrote: > On Feb 12, 2007, at 2:34 PM, Matteo Guglielmi wrote: > > >> Those nic "eth1" are not connected at all... all the machines use >> only the eth0 >> interface which have different IP for each PC. >> > > Gotcha. But, FWIW, OMPI doesn't know that because they have valid IP > addresses. So it thinks they're on the same subnet (on the same > host, actually), and therefore thinks that they should be routable. > > >> Anyway you solved my problem suggesting me those FAQ entries!!! >> --mca btl_tcp_if_exclude lo,eth1 that's the magic option which >> works for me!!! >> > > Excellent -- glad to help. > > Another solution might be to simply disable those NICs since they're > not hooked up to anything; then OMPI should work without any options. > Yep that's even better! > Good luck! > > Thanks again, I was playing around with the firewall so far and couldn't get any solution out of it... and now I know why... because the problem wasn't there!!! Oh my gosh... you helped me a lot! Cheers, MG.
[OMPI users] unsubscribe
unsubscribe Matteo Guglielmi | DALCO AG | Industriestr. 28 | 8604 Volketswil | Switzerland | T: +41 44 908 38 38 | D: +41 44 908 38 37 To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.
[OMPI users] qelr_alloc_context: Failed to allocate context for device.
I'm trying to get openmpi over RoCE working with this setup: card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov OS: CentOS 7.7 modinfo qede filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/qede/qede.ko.xz version:8.37.0.20 license:GPL description:QLogic FastLinQ 4 Ethernet Driver retpoline: Y rhelversion:7.7 srcversion: A6AFD0788918644F2EFFF31 alias: pci:v1077d8090sv*sd*bc*sc*i* alias: pci:v1077d8070sv*sd*bc*sc*i* alias: pci:v1077d1664sv*sd*bc*sc*i* alias: pci:v1077d1656sv*sd*bc*sc*i* alias: pci:v1077d1654sv*sd*bc*sc*i* alias: pci:v1077d1644sv*sd*bc*sc*i* alias: pci:v1077d1636sv*sd*bc*sc*i* alias: pci:v1077d1666sv*sd*bc*sc*i* alias: pci:v1077d1634sv*sd*bc*sc*i* depends:ptp,qed intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 parm: debug: Default debug msglevel (uint) modinfo qedr filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qedr.ko.xz license:Dual BSD/GPL author: QLogic Corporation description:QLogic 40G/100G ROCE Driver retpoline: Y rhelversion:7.7 srcversion: B5B65473217AA2B1F2F619B depends:qede,qed,ib_core intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 ibv_devinfo hca_id: qedr0 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:8439 sys_image_guid: b62e:99ff:fea7:8439 vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: qedr1 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:843a sys_image_guid: b62e:99ff:fea7:843a vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet RDMA actually works at system level which means that I cand do rdma ping-pong tests etc. But when I try to run openmpi with these options: mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ... I get the following error messages: -- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: node001 -- qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. -- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: node002 Local device: qedr0 Local port: 1 CPCs attempted: rdmacm -- qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. ... I've tried several things such as: 1) upgrade the 3.10 kernel's qed* drivers to the latest stable version 8.42.9 2) upgrade the CentOS kernel from 3.10 to 5.3 via elrepo 3) install the latest OFED-4.17-1.tgz stack but the error messages never go away ad do remain always the same. Any advice is highly appreciated.
Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.
I rolled everything back to stock centos 7.7 installing OFED via: yum groupinstall @infiniband yum install rdma-core-devel infiniband-diags-devel which does not install the ofed_info command, or at least I could not find it (do you know where it is?). openmpi is version 3.1.4 the fw version should be 8.37.7.0 will now try to upgrade the firmware since changing OS is not an option. Other suggestions? Thank you! From: Llolsten Kaonga Sent: Wednesday, November 13, 2019 3:25:16 PM To: 'Open MPI Users' Cc: Matteo Guglielmi Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for device. Hello Mateo, What version of openmpi are you running? Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It supports CentsOS 7.6. Apologies if you have already tried CentOS 7.6. We have been able to run openmpi (earlier this month): OS: CentOS 7.6 mpirun --version:3.1.4 ofed_info -s:OFED-4.17-1 RNIC fw version 8.50.9.0 Thanks. -- Llolsten -Original Message- From: users On Behalf Of Matteo Guglielmi via users Sent: Wednesday, November 13, 2019 2:12 AM To: users@lists.open-mpi.org Cc: Matteo Guglielmi Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for device. I'm trying to get openmpi over RoCE working with this setup: card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov OS: CentOS 7.7 modinfo qede filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q ede/qede.ko.xz version:8.37.0.20 license:GPL description:QLogic FastLinQ 4 Ethernet Driver retpoline: Y rhelversion:7.7 srcversion: A6AFD0788918644F2EFFF31 alias: pci:v1077d8090sv*sd*bc*sc*i* alias: pci:v1077d8070sv*sd*bc*sc*i* alias: pci:v1077d1664sv*sd*bc*sc*i* alias: pci:v1077d1656sv*sd*bc*sc*i* alias: pci:v1077d1654sv*sd*bc*sc*i* alias: pci:v1077d1644sv*sd*bc*sc*i* alias: pci:v1077d1636sv*sd*bc*sc*i* alias: pci:v1077d1666sv*sd*bc*sc*i* alias: pci:v1077d1634sv*sd*bc*sc*i* depends:ptp,qed intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 parm: debug: Default debug msglevel (uint) modinfo qedr filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe dr.ko.xz license:Dual BSD/GPL author: QLogic Corporation description:QLogic 40G/100G ROCE Driver retpoline: Y rhelversion:7.7 srcversion: B5B65473217AA2B1F2F619B depends:qede,qed,ib_core intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 ibv_devinfo hca_id: qedr0 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:8439 sys_image_guid: b62e:99ff:fea7:8439 vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: qedr1 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:843a sys_image_guid: b62e:99ff:fea7:843a vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet RDMA actually works at system level which means that I cand do rdma ping-pong tests etc. But when I try to run openmpi with these options: mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ... I get the following error messages: -- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: node001 -- qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. -- No OpenFabrics connection schemes reported that they were able to be used on a spe
Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.
I'm not using Mellanox OFED because the card is a Marvell OCP type 25Gb/s 2-port LAN Card. Kernel drivers used are: qede + qedr Beside that, I did a quick test on two nodes installing CentSO 7.6 and: ofed_info -s OFED-4.17-1: and now the error message is different: -- [[30578,1],1]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: node001 Another transport will be used instead, although this may result in lower performance. NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0. -- From: Jeff Squyres (jsquyres) Sent: Wednesday, November 13, 2019 7:16:41 PM To: Open MPI User's List Cc: Llolsten Kaonga; Matteo Guglielmi Subject: Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device. Have you tried using the UCX PML? The UCX PML is Mellanox's preferred Open MPI mechanism (instead of using the openib BTL). > On Nov 13, 2019, at 9:35 AM, Matteo Guglielmi via users > wrote: > > I rolled everything back to stock centos 7.7 installing OFED via: > > > > > yum groupinstall @infiniband > > yum install rdma-core-devel infiniband-diags-devel > > > which does not install the ofed_info command, or at least I could > not find it (do you know where it is?). > > > > openmpi is version 3.1.4 > > > > > the fw version should be 8.37.7.0 > > > > will now try to upgrade the firmware since changing OS is not an option. > > > > Other suggestions? > > > Thank you! > > > ________ > From: Llolsten Kaonga > Sent: Wednesday, November 13, 2019 3:25:16 PM > To: 'Open MPI Users' > Cc: Matteo Guglielmi > Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for > device. > > Hello Mateo, > > What version of openmpi are you running? > > Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It > supports CentsOS 7.6. > > Apologies if you have already tried CentOS 7.6. > > We have been able to run openmpi (earlier this month): > > OS: CentOS 7.6 > mpirun --version:3.1.4 > ofed_info -s:OFED-4.17-1 > > RNIC fw version 8.50.9.0 > > Thanks. > -- > Llolsten > > -Original Message- > From: users On Behalf Of Matteo Guglielmi > via users > Sent: Wednesday, November 13, 2019 2:12 AM > To: users@lists.open-mpi.org > Cc: Matteo Guglielmi > Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for > device. > > I'm trying to get openmpi over RoCE working with this setup: > > > > > card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov > > > OS: CentOS 7.7 > > > modinfo qede > > filename: > /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q > ede/qede.ko.xz > version:8.37.0.20 > license:GPL > description:QLogic FastLinQ 4 Ethernet Driver > retpoline: Y > rhelversion:7.7 > srcversion: A6AFD0788918644F2EFFF31 > alias: pci:v1077d8090sv*sd*bc*sc*i* > alias: pci:v1077d8070sv*sd*bc*sc*i* > alias: pci:v1077d1664sv*sd*bc*sc*i* > alias: pci:v1077d1656sv*sd*bc*sc*i* > alias: pci:v1077d1654sv*sd*bc*sc*i* > alias: pci:v1077d1644sv*sd*bc*sc*i* > alias: pci:v1077d1636sv*sd*bc*sc*i* > alias: pci:v1077d1666sv*sd*bc*sc*i* > alias: pci:v1077d1634sv*sd*bc*sc*i* > depends:ptp,qed > intree: Y > vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions > signer: CentOS Linux kernel signing key > sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F > sig_hashalgo: sha256 > parm: debug: Default debug msglevel (uint) > > modinfo qedr > > filename: > /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe > dr.ko.xz > license:Dual BSD/GPL > author: QLogic Corporation > description:QLogic 40G/100G ROCE Driver > retpoline: Y > rhelversion:7.7 > srcversion: B5B65473217AA2B1F2F619B > depends:qede,qed,ib_core > intree: Y > vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions > signer: CentOS Linux kernel signing key > sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:C
Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.
I cannot find a firmware for my card: https://www.gigabyte.com/za/Accessory/CLNOQ42-rev-10#ov Do you have the same model? I found this zip file of the web: Linux_FWupg_41xxx_2.10.78.zip which contains a firmware upgrade tool and a firmware version 8.50.83, but when I run it I get this error message (card is not supported): ./LnxQlgcUpg.sh Extracting package contents... QLogic Firmware Upgrade Utility for Linux: v2.10.78 NIC is not supported. Quitting program ... Program Exit Code: (16) Failed to upgraded MBI thank you. From: Llolsten Kaonga Sent: Wednesday, November 13, 2019 3:25:16 PM To: 'Open MPI Users' Cc: Matteo Guglielmi Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for device. Hello Mateo, What version of openmpi are you running? Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It supports CentsOS 7.6. Apologies if you have already tried CentOS 7.6. We have been able to run openmpi (earlier this month): OS: CentOS 7.6 mpirun --version:3.1.4 ofed_info -s:OFED-4.17-1 RNIC fw version 8.50.9.0 Thanks. -- Llolsten -Original Message- From: users On Behalf Of Matteo Guglielmi via users Sent: Wednesday, November 13, 2019 2:12 AM To: users@lists.open-mpi.org Cc: Matteo Guglielmi Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for device. I'm trying to get openmpi over RoCE working with this setup: card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov OS: CentOS 7.7 modinfo qede filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q ede/qede.ko.xz version:8.37.0.20 license:GPL description:QLogic FastLinQ 4 Ethernet Driver retpoline: Y rhelversion:7.7 srcversion: A6AFD0788918644F2EFFF31 alias: pci:v1077d8090sv*sd*bc*sc*i* alias: pci:v1077d8070sv*sd*bc*sc*i* alias: pci:v1077d1664sv*sd*bc*sc*i* alias: pci:v1077d1656sv*sd*bc*sc*i* alias: pci:v1077d1654sv*sd*bc*sc*i* alias: pci:v1077d1644sv*sd*bc*sc*i* alias: pci:v1077d1636sv*sd*bc*sc*i* alias: pci:v1077d1666sv*sd*bc*sc*i* alias: pci:v1077d1634sv*sd*bc*sc*i* depends:ptp,qed intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 parm: debug: Default debug msglevel (uint) modinfo qedr filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe dr.ko.xz license:Dual BSD/GPL author: QLogic Corporation description:QLogic 40G/100G ROCE Driver retpoline: Y rhelversion:7.7 srcversion: B5B65473217AA2B1F2F619B depends:qede,qed,ib_core intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 ibv_devinfo hca_id: qedr0 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:8439 sys_image_guid: b62e:99ff:fea7:8439 vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: qedr1 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:843a sys_image_guid: b62e:99ff:fea7:843a vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet RDMA actually works at system level which means that I cand do rdma ping-pong tests etc. But when I try to run openmpi with these options: mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ... I get the following error messages: -- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: node001 -- qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate con