[OMPI users] openMPI 1.1.4 - connect() failed with errno=111

2007-02-11 Thread matteo . guglielmi
Since I've installed openmpi I cannot submit any job that uses cpus from
different machines.

### hostfile ###
lcbcpc02.epfl.ch slots=4 max-slots=4
lcbcpc04.epfl.ch slots=4 max-slots=4


### error message ###
[matteo@lcbcpc02 TEST]$ mpirun --hostfile ~matteo/hostfile -np 8
/home/matteo/Software/NWChem/5.0/bin/nwchem ./nwchem.nw
[0,1,5][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
[0,1,6][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=111
6: lcbcpc04.epfl.ch len=16
[0,1,4][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=111
4: lcbcpc04.epfl.ch len=16
[0,1,7][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=111
7: lcbcpc04.epfl.ch len=16
connect() failed with errno=111
5: lcbcpc04.epfl.ch len=16
#

I did disable the firewall on both machines but I still get that error message.

Thanks,
MG.


Re: [OMPI users] openMPI 1.1.4 - connect() failed with errno=111

2007-02-12 Thread Matteo Guglielmi
This is the ifconfig output from the machine I'm used to submit the
parallel job:

### ifconfig output - master node ###

[root@lcbcpc02 ~]# ifconfig
eth0  Link encap:Ethernet  HWaddr 00:15:17:10:53:C8 
  inet addr:128.178.54.74  Bcast:128.178.54.255  Mask:255.255.255.0
  inet6 addr: fe80::215:17ff:fe10:53c8/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:11563938 errors:0 dropped:0 overruns:0 frame:0
  TX packets:6670398 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:16562149093 (15.4 GiB)  TX bytes:1312532185 (1.2 GiB)
  Base address:0x2020 Memory:c282-c284

eth1  Link encap:Ethernet  HWaddr 00:15:17:10:53:C9 
  inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.255.0
  inet6 addr: fe80::215:17ff:fe10:53c9/64 Scope:Link
  UP BROADCAST MULTICAST  MTU:1500  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
  Base address:0x2000 Memory:c280-c282

loLink encap:Local Loopback 
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:468156 errors:0 dropped:0 overruns:0 frame:0
  TX packets:468156 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:500286061 (477.1 MiB)  TX bytes:500286061 (477.1 MiB)




This is the ifconfig output from the "slave node":

### ifconfig output - slave node ###

[root@lcbcpc04 ~]# ifconfig
eth0  Link encap:Ethernet  HWaddr 00:15:17:10:53:74 
  inet addr:128.178.54.76  Bcast:128.178.54.255  Mask:255.255.255.0
  inet6 addr: fe80::215:17ff:fe10:5374/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:320264 errors:0 dropped:0 overruns:0 frame:0
  TX packets:151942 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:139280839 (132.8 MiB)  TX bytes:82889237 (79.0 MiB)
  Base address:0x2020 Memory:c282-c284

eth1  Link encap:Ethernet  HWaddr 00:15:17:10:53:75 
  inet addr:192.168.0.1  Bcast:192.168.0.255  Mask:255.255.255.0
  inet6 addr: fe80::215:17ff:fe10:5375/64 Scope:Link
  UP BROADCAST MULTICAST  MTU:1500  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
  Base address:0x2000 Memory:c280-c282

loLink encap:Local Loopback 
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:2820 errors:0 dropped:0 overruns:0 frame:0
  TX packets:2820 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:2178053 (2.0 MiB)  TX bytes:2178053 (2.0 MiB)


Thanks Jeff!!!



Jeff Squyres wrote:
> I'm assuming that these are Linux hosts.  If so, errno 111 is  
> "connection refused" possibly meaning that there is still some  
> firewall active or the wrong interface is being used to establish  
> connections between these machines.
>
> Can you send the output of "ifconfig" (might be /sbin/ifconfig on  
> your machine?) from both machines?
>
>
> On Feb 11, 2007, at 3:45 PM, matteo.guglie...@epfl.ch wrote:
>
>   
>> Since I've installed openmpi I cannot submit any job that uses cpus  
>> from
>> different machines.
>>
>> ### hostfile ###
>> lcbcpc02.epfl.ch slots=4 max-slots=4
>> lcbcpc04.epfl.ch slots=4 max-slots=4
>> 
>>
>> ### error message ###
>> [matteo@lcbcpc02 TEST]$ mpirun --hostfile ~matteo/hostfile -np 8
>> /home/matteo/Software/NWChem/5.0/bin/nwchem ./nwchem.nw
>> [0,1,5][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: 
>> 572:mca_btl_tcp_endpoint_complete_connect]
>> [0,1,6][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: 
>> 572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=111
>> 6: lcbcpc04.epfl.ch len=16
>> [0,1,4][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: 
>> 572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=111
>> 4: lcbcpc04.epfl.ch len=16
>> [0,1,7][../../../../../ompi/mca/btl/tcp/btl_tcp_endpoint.c: 
>> 572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=111
>> 7: lcbcpc04.epfl.ch len=16
>> connect() failed with errno=111
>> 5: lcbcpc04.epfl.ch len=16
>> #
>>
>> I did disable the firewall on both machines but I still get that  
>> error message.
>>
>> Thanks,
>> MG.
>> ___
>> users mailing list
>> us...

Re: [OMPI users] openMPI 1.1.4 - connect() failed with errno=111

2007-02-12 Thread Matteo Guglielmi
Jeff Squyres wrote:
> On Feb 12, 2007, at 12:54 PM, Matteo Guglielmi wrote:
>
>   
>> This is the ifconfig output from the machine I'm used to submit the
>> parallel job:
>> 
>
> It looks like both of your nodes share an IP address:
>
>   
>> [root@lcbcpc02 ~]# ifconfig
>> eth1  Link encap:Ethernet  HWaddr 00:15:17:10:53:C9
>>   inet addr:192.168.0.1  Bcast:192.168.0.255  Mask: 
>> 255.255.255.0
>> [root@lcbcpc04 ~]# ifconfig
>> eth1  Link encap:Ethernet  HWaddr 00:15:17:10:53:75
>>   inet addr:192.168.0.1  Bcast:192.168.0.255  Mask: 
>> 255.255.255.0
>> 
>
> This will be problematic to more than just OMPI if these two  
> interfaces are on the same network.  The solution is to ensure that  
> all your nodes have unique IP addresses.
>
> If these NICs are on different networks, than it's a valid network  
> configuration, but Open MPI (by default) will assume that these are  
> routable to each other.  You can tell Open MPI to not use eth1 in  
> this case -- see this FAQ entries for details:
>
>http://www.open-mpi.org/faq/?category=tcp#tcp-multi-network
>http://www.open-mpi.org/faq/?category=tcp#tcp-selection
>http://www.open-mpi.org/faq/?category=tcp#tcp-routability
>
>   
Those nic "eth1" are not connected at all... all the machines use only
the eth0
interface which have different IP for each PC.

Anyway you solved my problem suggesting me those FAQ entries!!!

*--mca btl_tcp_if_exclude lo,eth1

that's the magic option which works for me!!!


*



Thanks Jeff!!!
Thanks

MG.


Re: [OMPI users] openMPI 1.1.4 - connect() failed with errno=111

2007-02-12 Thread Matteo Guglielmi
Jeff Squyres wrote:
> On Feb 12, 2007, at 2:34 PM, Matteo Guglielmi wrote:
>
>   
>> Those nic "eth1" are not connected at all... all the machines use  
>> only the eth0
>> interface which have different IP for each PC.
>> 
>
> Gotcha.  But, FWIW, OMPI doesn't know that because they have valid IP  
> addresses.  So it thinks they're on the same subnet (on the same  
> host, actually), and therefore thinks that they should be routable.
>
>   
>> Anyway you solved my problem suggesting me those FAQ entries!!!
>> --mca btl_tcp_if_exclude lo,eth1 that's the magic option which  
>> works for me!!!
>> 
>
> Excellent -- glad to help.
>
> Another solution might be to simply disable those NICs since they're  
> not hooked up to anything; then OMPI should work without any options.
>   
Yep that's even better!
> Good luck!
>
>   
Thanks again,

I was playing around with the firewall so far and couldn't get any solution
out of it... and now I know why... because the problem wasn't there!!!

Oh my gosh... you helped me a lot!

Cheers,
MG.


[OMPI users] unsubscribe

2025-03-31 Thread Matteo Guglielmi
unsubscribe


Matteo Guglielmi | DALCO AG | Industriestr. 28 | 8604 Volketswil | Switzerland 
| T: +41 44 908 38 38 | D: +41 44 908 38 37

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.



[OMPI users] qelr_alloc_context: Failed to allocate context for device.

2019-11-12 Thread Matteo Guglielmi via users
I'm trying to get openmpi over RoCE working with this setup:




card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov


OS: CentOS 7.7


modinfo qede

filename:   
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/qede/qede.ko.xz
version:8.37.0.20
license:GPL
description:QLogic FastLinQ 4 Ethernet Driver
retpoline:  Y
rhelversion:7.7
srcversion: A6AFD0788918644F2EFFF31
alias:  pci:v1077d8090sv*sd*bc*sc*i*
alias:  pci:v1077d8070sv*sd*bc*sc*i*
alias:  pci:v1077d1664sv*sd*bc*sc*i*
alias:  pci:v1077d1656sv*sd*bc*sc*i*
alias:  pci:v1077d1654sv*sd*bc*sc*i*
alias:  pci:v1077d1644sv*sd*bc*sc*i*
alias:  pci:v1077d1636sv*sd*bc*sc*i*
alias:  pci:v1077d1666sv*sd*bc*sc*i*
alias:  pci:v1077d1634sv*sd*bc*sc*i*
depends:ptp,qed
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256
parm:   debug: Default debug msglevel (uint)

modinfo qedr

filename:   
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qedr.ko.xz
license:Dual BSD/GPL
author: QLogic Corporation
description:QLogic 40G/100G ROCE Driver
retpoline:  Y
rhelversion:7.7
srcversion: B5B65473217AA2B1F2F619B
depends:qede,qed,ib_core
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256

ibv_devinfo

hca_id: qedr0
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:8439
sys_image_guid: b62e:99ff:fea7:8439
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

hca_id: qedr1
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:843a
sys_image_guid: b62e:99ff:fea7:843a
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet




RDMA actually works at system level which means that I cand do

rdma ping-pong tests etc.



But when I try to run openmpi with these options:



mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...





I get the following error messages:




--
WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them).  This is most certainly not what you wanted.  Check your
cables, subnet manager configuration, etc.  The openib BTL will be
ignored for this job.

  Local host: node001
--
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   node002
  Local device: qedr0
  Local port:   1
  CPCs attempted:   rdmacm
--
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.

...


I've tried several things such as:


1) upgrade the 3.10 kernel's qed* drivers to the latest stable version 8.42.9

2) upgrade the CentOS kernel from 3.10 to 5.3 via elrepo

3) install the latest OFED-4.17-1.tgz stack


but the error messages never go away ad do remain always the same.



Any advice is highly appreciated.



Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.

2019-11-13 Thread Matteo Guglielmi via users
I rolled everything back to stock centos 7.7 installing OFED via:




yum groupinstall @infiniband

yum install rdma-core-devel infiniband-diags-devel


which does not install the ofed_info command, or at least I could
not find it (do you know where it is?).



openmpi is version 3.1.4




the fw version should be 8.37.7.0



will now try to upgrade the firmware since changing OS is not an option.



Other suggestions?


Thank you!



From: Llolsten Kaonga 
Sent: Wednesday, November 13, 2019 3:25:16 PM
To: 'Open MPI Users'
Cc: Matteo Guglielmi
Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for 
device.

Hello Mateo,

What version of openmpi are you running?

Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It
supports CentsOS 7.6.

Apologies if you have already tried CentOS 7.6.

We have been able to run openmpi (earlier this month):

OS:  CentOS 7.6
mpirun --version:3.1.4
ofed_info -s:OFED-4.17-1

RNIC fw version  8.50.9.0

Thanks.
--
Llolsten

-Original Message-
From: users  On Behalf Of Matteo Guglielmi
via users
Sent: Wednesday, November 13, 2019 2:12 AM
To: users@lists.open-mpi.org
Cc: Matteo Guglielmi 
Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for
device.

I'm trying to get openmpi over RoCE working with this setup:




card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov


OS: CentOS 7.7


modinfo qede

filename:
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q
ede/qede.ko.xz
version:8.37.0.20
license:GPL
description:QLogic FastLinQ 4 Ethernet Driver
retpoline:  Y
rhelversion:7.7
srcversion: A6AFD0788918644F2EFFF31
alias:  pci:v1077d8090sv*sd*bc*sc*i*
alias:  pci:v1077d8070sv*sd*bc*sc*i*
alias:  pci:v1077d1664sv*sd*bc*sc*i*
alias:  pci:v1077d1656sv*sd*bc*sc*i*
alias:  pci:v1077d1654sv*sd*bc*sc*i*
alias:  pci:v1077d1644sv*sd*bc*sc*i*
alias:  pci:v1077d1636sv*sd*bc*sc*i*
alias:  pci:v1077d1666sv*sd*bc*sc*i*
alias:  pci:v1077d1634sv*sd*bc*sc*i*
depends:ptp,qed
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256
parm:   debug: Default debug msglevel (uint)

modinfo qedr

filename:
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe
dr.ko.xz
license:Dual BSD/GPL
author: QLogic Corporation
description:QLogic 40G/100G ROCE Driver
retpoline:  Y
rhelversion:7.7
srcversion: B5B65473217AA2B1F2F619B
depends:qede,qed,ib_core
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256

ibv_devinfo

hca_id: qedr0
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:8439
sys_image_guid: b62e:99ff:fea7:8439
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

hca_id: qedr1
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:843a
sys_image_guid: b62e:99ff:fea7:843a
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet




RDMA actually works at system level which means that I cand do

rdma ping-pong tests etc.



But when I try to run openmpi with these options:



mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...





I get the following error messages:




--
WARNING: There is at least non-excluded one OpenFabrics device found, but
there are no active ports detected (or Open MPI was unable to use them).
This is most certainly not what you wanted.  Check your cables, subnet
manager configuration, etc.  The openib BTL will be ignored for this job.

  Local host: node001
--
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
--
No OpenFabrics connection schemes reported that they were able to be used on
a spe

Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.

2019-11-13 Thread Matteo Guglielmi via users
I'm not using Mellanox OFED because the card

is a Marvell OCP type 25Gb/s 2-port LAN Card.


Kernel drivers used are:


qede + qedr



Beside that,


I did a quick test on two nodes installing

CentSO 7.6 and:


ofed_info -s

OFED-4.17-1:


and now the error message is different:


--
[[30578,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: node001

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--





From: Jeff Squyres (jsquyres) 
Sent: Wednesday, November 13, 2019 7:16:41 PM
To: Open MPI User's List
Cc: Llolsten Kaonga; Matteo Guglielmi
Subject: Re: [OMPI users] qelr_alloc_context: Failed to allocate context for 
device.

Have you tried using the UCX PML?

The UCX PML is Mellanox's preferred Open MPI mechanism (instead of using the 
openib BTL).


> On Nov 13, 2019, at 9:35 AM, Matteo Guglielmi via users 
>  wrote:
>
> I rolled everything back to stock centos 7.7 installing OFED via:
>
>
>
>
> yum groupinstall @infiniband
>
> yum install rdma-core-devel infiniband-diags-devel
>
>
> which does not install the ofed_info command, or at least I could
> not find it (do you know where it is?).
>
>
>
> openmpi is version 3.1.4
>
>
>
>
> the fw version should be 8.37.7.0
>
>
>
> will now try to upgrade the firmware since changing OS is not an option.
>
>
>
> Other suggestions?
>
>
> Thank you!
>
>
> ________
> From: Llolsten Kaonga 
> Sent: Wednesday, November 13, 2019 3:25:16 PM
> To: 'Open MPI Users'
> Cc: Matteo Guglielmi
> Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for 
> device.
>
> Hello Mateo,
>
> What version of openmpi are you running?
>
> Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It
> supports CentsOS 7.6.
>
> Apologies if you have already tried CentOS 7.6.
>
> We have been able to run openmpi (earlier this month):
>
> OS:  CentOS 7.6
> mpirun --version:3.1.4
> ofed_info -s:OFED-4.17-1
>
> RNIC fw version  8.50.9.0
>
> Thanks.
> --
> Llolsten
>
> -Original Message-
> From: users  On Behalf Of Matteo Guglielmi
> via users
> Sent: Wednesday, November 13, 2019 2:12 AM
> To: users@lists.open-mpi.org
> Cc: Matteo Guglielmi 
> Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for
> device.
>
> I'm trying to get openmpi over RoCE working with this setup:
>
>
>
>
> card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov
>
>
> OS: CentOS 7.7
>
>
> modinfo qede
>
> filename:
> /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q
> ede/qede.ko.xz
> version:8.37.0.20
> license:GPL
> description:QLogic FastLinQ 4 Ethernet Driver
> retpoline:  Y
> rhelversion:7.7
> srcversion: A6AFD0788918644F2EFFF31
> alias:  pci:v1077d8090sv*sd*bc*sc*i*
> alias:  pci:v1077d8070sv*sd*bc*sc*i*
> alias:  pci:v1077d1664sv*sd*bc*sc*i*
> alias:  pci:v1077d1656sv*sd*bc*sc*i*
> alias:  pci:v1077d1654sv*sd*bc*sc*i*
> alias:  pci:v1077d1644sv*sd*bc*sc*i*
> alias:  pci:v1077d1636sv*sd*bc*sc*i*
> alias:  pci:v1077d1666sv*sd*bc*sc*i*
> alias:  pci:v1077d1634sv*sd*bc*sc*i*
> depends:ptp,qed
> intree: Y
> vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
> signer: CentOS Linux kernel signing key
> sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
> sig_hashalgo:   sha256
> parm:   debug: Default debug msglevel (uint)
>
> modinfo qedr
>
> filename:
> /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe
> dr.ko.xz
> license:Dual BSD/GPL
> author: QLogic Corporation
> description:QLogic 40G/100G ROCE Driver
> retpoline:  Y
> rhelversion:7.7
> srcversion: B5B65473217AA2B1F2F619B
> depends:qede,qed,ib_core
> intree: Y
> vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
> signer: CentOS Linux kernel signing key
> sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:C

Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.

2019-11-13 Thread Matteo Guglielmi via users
I cannot find a firmware for my card:


https://www.gigabyte.com/za/Accessory/CLNOQ42-rev-10#ov


Do you have the same model?



I found this zip file of the web:


Linux_FWupg_41xxx_2.10.78.zip

which contains a firmware upgrade tool and a firmware
version 8.50.83, but when I run it I get this error
message (card is not supported):


./LnxQlgcUpg.sh
Extracting package contents...

QLogic Firmware Upgrade Utility for Linux: v2.10.78

NIC is not supported.
Quitting program ...
Program Exit Code: (16)
Failed to upgraded MBI

thank you.


From: Llolsten Kaonga 
Sent: Wednesday, November 13, 2019 3:25:16 PM
To: 'Open MPI Users'
Cc: Matteo Guglielmi
Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for 
device.

Hello Mateo,

What version of openmpi are you running?

Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It
supports CentsOS 7.6.

Apologies if you have already tried CentOS 7.6.

We have been able to run openmpi (earlier this month):

OS:  CentOS 7.6
mpirun --version:3.1.4
ofed_info -s:OFED-4.17-1

RNIC fw version  8.50.9.0

Thanks.
--
Llolsten

-Original Message-
From: users  On Behalf Of Matteo Guglielmi
via users
Sent: Wednesday, November 13, 2019 2:12 AM
To: users@lists.open-mpi.org
Cc: Matteo Guglielmi 
Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for
device.

I'm trying to get openmpi over RoCE working with this setup:




card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov


OS: CentOS 7.7


modinfo qede

filename:
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q
ede/qede.ko.xz
version:8.37.0.20
license:GPL
description:QLogic FastLinQ 4 Ethernet Driver
retpoline:  Y
rhelversion:7.7
srcversion: A6AFD0788918644F2EFFF31
alias:  pci:v1077d8090sv*sd*bc*sc*i*
alias:  pci:v1077d8070sv*sd*bc*sc*i*
alias:  pci:v1077d1664sv*sd*bc*sc*i*
alias:  pci:v1077d1656sv*sd*bc*sc*i*
alias:  pci:v1077d1654sv*sd*bc*sc*i*
alias:  pci:v1077d1644sv*sd*bc*sc*i*
alias:  pci:v1077d1636sv*sd*bc*sc*i*
alias:  pci:v1077d1666sv*sd*bc*sc*i*
alias:  pci:v1077d1634sv*sd*bc*sc*i*
depends:ptp,qed
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256
parm:   debug: Default debug msglevel (uint)

modinfo qedr

filename:
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe
dr.ko.xz
license:Dual BSD/GPL
author: QLogic Corporation
description:QLogic 40G/100G ROCE Driver
retpoline:  Y
rhelversion:7.7
srcversion: B5B65473217AA2B1F2F619B
depends:qede,qed,ib_core
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256

ibv_devinfo

hca_id: qedr0
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:8439
sys_image_guid: b62e:99ff:fea7:8439
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

hca_id: qedr1
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:843a
sys_image_guid: b62e:99ff:fea7:843a
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet




RDMA actually works at system level which means that I cand do

rdma ping-pong tests etc.



But when I try to run openmpi with these options:



mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...





I get the following error messages:




--
WARNING: There is at least non-excluded one OpenFabrics device found, but
there are no active ports detected (or Open MPI was unable to use them).
This is most certainly not what you wanted.  Check your cables, subnet
manager configuration, etc.  The openib BTL will be ignored for this job.

  Local host: node001
--
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate con