Re: [OMPI users] Change behavior of --output-filename

2019-11-13 Thread Max Sagebaum via users
Ok thank you for the github links. I missed those. But the question remains if 
the old functionality in ./orte/orted/orted_main.c is still accessible, by some 
configuration parameters.

I will also push the github questions.








Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.

2019-11-13 Thread Llolsten Kaonga via users
Hello Mateo,

What version of openmpi are you running?

Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It
supports CentsOS 7.6.

Apologies if you have already tried CentOS 7.6.

We have been able to run openmpi (earlier this month):

OS: CentOS 7.6
mpirun --version:   3.1.4
ofed_info -s:   OFED-4.17-1

RNIC fw version  8.50.9.0

Thanks.
--
Llolsten

-Original Message-
From: users  On Behalf Of Matteo Guglielmi
via users
Sent: Wednesday, November 13, 2019 2:12 AM
To: users@lists.open-mpi.org
Cc: Matteo Guglielmi 
Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for
device.

I'm trying to get openmpi over RoCE working with this setup:




card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov


OS: CentOS 7.7


modinfo qede

filename:
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q
ede/qede.ko.xz
version:8.37.0.20
license:GPL
description:QLogic FastLinQ 4 Ethernet Driver
retpoline:  Y
rhelversion:7.7
srcversion: A6AFD0788918644F2EFFF31
alias:  pci:v1077d8090sv*sd*bc*sc*i*
alias:  pci:v1077d8070sv*sd*bc*sc*i*
alias:  pci:v1077d1664sv*sd*bc*sc*i*
alias:  pci:v1077d1656sv*sd*bc*sc*i*
alias:  pci:v1077d1654sv*sd*bc*sc*i*
alias:  pci:v1077d1644sv*sd*bc*sc*i*
alias:  pci:v1077d1636sv*sd*bc*sc*i*
alias:  pci:v1077d1666sv*sd*bc*sc*i*
alias:  pci:v1077d1634sv*sd*bc*sc*i*
depends:ptp,qed
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256
parm:   debug: Default debug msglevel (uint)

modinfo qedr

filename:
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe
dr.ko.xz
license:Dual BSD/GPL
author: QLogic Corporation
description:QLogic 40G/100G ROCE Driver
retpoline:  Y
rhelversion:7.7
srcversion: B5B65473217AA2B1F2F619B
depends:qede,qed,ib_core
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256

ibv_devinfo

hca_id: qedr0
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:8439
sys_image_guid: b62e:99ff:fea7:8439
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

hca_id: qedr1
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:843a
sys_image_guid: b62e:99ff:fea7:843a
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet




RDMA actually works at system level which means that I cand do

rdma ping-pong tests etc.



But when I try to run openmpi with these options:



mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...





I get the following error messages:




--
WARNING: There is at least non-excluded one OpenFabrics device found, but
there are no active ports detected (or Open MPI was unable to use them).
This is most certainly not what you wanted.  Check your cables, subnet
manager configuration, etc.  The openib BTL will be ignored for this job.

  Local host: node001
--
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
--
No OpenFabrics connection schemes reported that they were able to be used on
a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   node002
  Local device: qedr0
  Local port:   1
  CPCs attempted:   rdmacm
--
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.

...


I've tried several things such as:


1) upgrade the 3.10 kernel's qed* drivers to the latest stable version
8.42.9

2) upgrade the CentOS kernel from 3.10 to 5.3 via elrepo

3) install the latest OFED-4.17-1.tgz stack


but the error messages never go away ad do remain always the 

Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.

2019-11-13 Thread Matteo Guglielmi via users
I rolled everything back to stock centos 7.7 installing OFED via:




yum groupinstall @infiniband

yum install rdma-core-devel infiniband-diags-devel


which does not install the ofed_info command, or at least I could
not find it (do you know where it is?).



openmpi is version 3.1.4




the fw version should be 8.37.7.0



will now try to upgrade the firmware since changing OS is not an option.



Other suggestions?


Thank you!



From: Llolsten Kaonga 
Sent: Wednesday, November 13, 2019 3:25:16 PM
To: 'Open MPI Users'
Cc: Matteo Guglielmi
Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for 
device.

Hello Mateo,

What version of openmpi are you running?

Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It
supports CentsOS 7.6.

Apologies if you have already tried CentOS 7.6.

We have been able to run openmpi (earlier this month):

OS:  CentOS 7.6
mpirun --version:3.1.4
ofed_info -s:OFED-4.17-1

RNIC fw version  8.50.9.0

Thanks.
--
Llolsten

-Original Message-
From: users  On Behalf Of Matteo Guglielmi
via users
Sent: Wednesday, November 13, 2019 2:12 AM
To: users@lists.open-mpi.org
Cc: Matteo Guglielmi 
Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for
device.

I'm trying to get openmpi over RoCE working with this setup:




card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov


OS: CentOS 7.7


modinfo qede

filename:
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q
ede/qede.ko.xz
version:8.37.0.20
license:GPL
description:QLogic FastLinQ 4 Ethernet Driver
retpoline:  Y
rhelversion:7.7
srcversion: A6AFD0788918644F2EFFF31
alias:  pci:v1077d8090sv*sd*bc*sc*i*
alias:  pci:v1077d8070sv*sd*bc*sc*i*
alias:  pci:v1077d1664sv*sd*bc*sc*i*
alias:  pci:v1077d1656sv*sd*bc*sc*i*
alias:  pci:v1077d1654sv*sd*bc*sc*i*
alias:  pci:v1077d1644sv*sd*bc*sc*i*
alias:  pci:v1077d1636sv*sd*bc*sc*i*
alias:  pci:v1077d1666sv*sd*bc*sc*i*
alias:  pci:v1077d1634sv*sd*bc*sc*i*
depends:ptp,qed
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256
parm:   debug: Default debug msglevel (uint)

modinfo qedr

filename:
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe
dr.ko.xz
license:Dual BSD/GPL
author: QLogic Corporation
description:QLogic 40G/100G ROCE Driver
retpoline:  Y
rhelversion:7.7
srcversion: B5B65473217AA2B1F2F619B
depends:qede,qed,ib_core
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256

ibv_devinfo

hca_id: qedr0
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:8439
sys_image_guid: b62e:99ff:fea7:8439
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

hca_id: qedr1
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:843a
sys_image_guid: b62e:99ff:fea7:843a
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet




RDMA actually works at system level which means that I cand do

rdma ping-pong tests etc.



But when I try to run openmpi with these options:



mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...





I get the following error messages:




--
WARNING: There is at least non-excluded one OpenFabrics device found, but
there are no active ports detected (or Open MPI was unable to use them).
This is most certainly not what you wanted.  Check your cables, subnet
manager configuration, etc.  The openib BTL will be ignored for this job.

  Local host: node001
--
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
--
No OpenFabrics connection schemes reported that they were able to be used on
a specific port.  As such

Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.

2019-11-13 Thread Jeff Squyres (jsquyres) via users
Have you tried using the UCX PML?

The UCX PML is Mellanox's preferred Open MPI mechanism (instead of using the 
openib BTL).


> On Nov 13, 2019, at 9:35 AM, Matteo Guglielmi via users 
>  wrote:
> 
> I rolled everything back to stock centos 7.7 installing OFED via:
> 
> 
> 
> 
> yum groupinstall @infiniband
> 
> yum install rdma-core-devel infiniband-diags-devel
> 
> 
> which does not install the ofed_info command, or at least I could
> not find it (do you know where it is?).
> 
> 
> 
> openmpi is version 3.1.4
> 
> 
> 
> 
> the fw version should be 8.37.7.0
> 
> 
> 
> will now try to upgrade the firmware since changing OS is not an option.
> 
> 
> 
> Other suggestions?
> 
> 
> Thank you!
> 
> 
> 
> From: Llolsten Kaonga 
> Sent: Wednesday, November 13, 2019 3:25:16 PM
> To: 'Open MPI Users'
> Cc: Matteo Guglielmi
> Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for 
> device.
> 
> Hello Mateo,
> 
> What version of openmpi are you running?
> 
> Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It
> supports CentsOS 7.6.
> 
> Apologies if you have already tried CentOS 7.6.
> 
> We have been able to run openmpi (earlier this month):
> 
> OS:  CentOS 7.6
> mpirun --version:3.1.4
> ofed_info -s:OFED-4.17-1
> 
> RNIC fw version  8.50.9.0
> 
> Thanks.
> --
> Llolsten
> 
> -Original Message-
> From: users  On Behalf Of Matteo Guglielmi
> via users
> Sent: Wednesday, November 13, 2019 2:12 AM
> To: users@lists.open-mpi.org
> Cc: Matteo Guglielmi 
> Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for
> device.
> 
> I'm trying to get openmpi over RoCE working with this setup:
> 
> 
> 
> 
> card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov
> 
> 
> OS: CentOS 7.7
> 
> 
> modinfo qede
> 
> filename:
> /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q
> ede/qede.ko.xz
> version:8.37.0.20
> license:GPL
> description:QLogic FastLinQ 4 Ethernet Driver
> retpoline:  Y
> rhelversion:7.7
> srcversion: A6AFD0788918644F2EFFF31
> alias:  pci:v1077d8090sv*sd*bc*sc*i*
> alias:  pci:v1077d8070sv*sd*bc*sc*i*
> alias:  pci:v1077d1664sv*sd*bc*sc*i*
> alias:  pci:v1077d1656sv*sd*bc*sc*i*
> alias:  pci:v1077d1654sv*sd*bc*sc*i*
> alias:  pci:v1077d1644sv*sd*bc*sc*i*
> alias:  pci:v1077d1636sv*sd*bc*sc*i*
> alias:  pci:v1077d1666sv*sd*bc*sc*i*
> alias:  pci:v1077d1634sv*sd*bc*sc*i*
> depends:ptp,qed
> intree: Y
> vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
> signer: CentOS Linux kernel signing key
> sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
> sig_hashalgo:   sha256
> parm:   debug: Default debug msglevel (uint)
> 
> modinfo qedr
> 
> filename:
> /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe
> dr.ko.xz
> license:Dual BSD/GPL
> author: QLogic Corporation
> description:QLogic 40G/100G ROCE Driver
> retpoline:  Y
> rhelversion:7.7
> srcversion: B5B65473217AA2B1F2F619B
> depends:qede,qed,ib_core
> intree: Y
> vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
> signer: CentOS Linux kernel signing key
> sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
> sig_hashalgo:   sha256
> 
> ibv_devinfo
> 
> hca_id: qedr0
> transport: InfiniBand (0)
> fw_ver: 8.37.7.0
> node_guid: b62e:99ff:fea7:8439
> sys_image_guid: b62e:99ff:fea7:8439
> vendor_id: 0x1077
> vendor_part_id: 32880
> hw_ver: 0x0
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> link_layer: Ethernet
> 
> hca_id: qedr1
> transport: InfiniBand (0)
> fw_ver: 8.37.7.0
> node_guid: b62e:99ff:fea7:843a
> sys_image_guid: b62e:99ff:fea7:843a
> vendor_id: 0x1077
> vendor_part_id: 32880
> hw_ver: 0x0
> phys_port_cnt: 1
> port: 1
> state: PORT_DOWN (1)
> max_mtu: 4096 (5)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> link_layer: Ethernet
> 
> 
> 
> 
> RDMA actually works at system level which means that I cand do
> 
> rdma ping-pong tests etc.
> 
> 
> 
> But when I try to run openmpi with these options:
> 
> 
> 
> mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...
> 
> 
> 
> 
> 
> I get the following error messages:
> 
> 
> 
> 
> --
> WARNING: There is at least non-excluded one OpenFabrics device found, but
> there are no active ports detected (or Open MPI was unable to use them).
> This is most certainly not what you wanted.  Check your cables, subnet
> manager configuration, etc.  The openib BTL will be 

Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.

2019-11-13 Thread Matteo Guglielmi via users
I'm not using Mellanox OFED because the card

is a Marvell OCP type 25Gb/s 2-port LAN Card.


Kernel drivers used are:


qede + qedr



Beside that,


I did a quick test on two nodes installing

CentSO 7.6 and:


ofed_info -s

OFED-4.17-1:


and now the error message is different:


--
[[30578,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: node001

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--





From: Jeff Squyres (jsquyres) 
Sent: Wednesday, November 13, 2019 7:16:41 PM
To: Open MPI User's List
Cc: Llolsten Kaonga; Matteo Guglielmi
Subject: Re: [OMPI users] qelr_alloc_context: Failed to allocate context for 
device.

Have you tried using the UCX PML?

The UCX PML is Mellanox's preferred Open MPI mechanism (instead of using the 
openib BTL).


> On Nov 13, 2019, at 9:35 AM, Matteo Guglielmi via users 
>  wrote:
>
> I rolled everything back to stock centos 7.7 installing OFED via:
>
>
>
>
> yum groupinstall @infiniband
>
> yum install rdma-core-devel infiniband-diags-devel
>
>
> which does not install the ofed_info command, or at least I could
> not find it (do you know where it is?).
>
>
>
> openmpi is version 3.1.4
>
>
>
>
> the fw version should be 8.37.7.0
>
>
>
> will now try to upgrade the firmware since changing OS is not an option.
>
>
>
> Other suggestions?
>
>
> Thank you!
>
>
> 
> From: Llolsten Kaonga 
> Sent: Wednesday, November 13, 2019 3:25:16 PM
> To: 'Open MPI Users'
> Cc: Matteo Guglielmi
> Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for 
> device.
>
> Hello Mateo,
>
> What version of openmpi are you running?
>
> Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It
> supports CentsOS 7.6.
>
> Apologies if you have already tried CentOS 7.6.
>
> We have been able to run openmpi (earlier this month):
>
> OS:  CentOS 7.6
> mpirun --version:3.1.4
> ofed_info -s:OFED-4.17-1
>
> RNIC fw version  8.50.9.0
>
> Thanks.
> --
> Llolsten
>
> -Original Message-
> From: users  On Behalf Of Matteo Guglielmi
> via users
> Sent: Wednesday, November 13, 2019 2:12 AM
> To: users@lists.open-mpi.org
> Cc: Matteo Guglielmi 
> Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for
> device.
>
> I'm trying to get openmpi over RoCE working with this setup:
>
>
>
>
> card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov
>
>
> OS: CentOS 7.7
>
>
> modinfo qede
>
> filename:
> /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q
> ede/qede.ko.xz
> version:8.37.0.20
> license:GPL
> description:QLogic FastLinQ 4 Ethernet Driver
> retpoline:  Y
> rhelversion:7.7
> srcversion: A6AFD0788918644F2EFFF31
> alias:  pci:v1077d8090sv*sd*bc*sc*i*
> alias:  pci:v1077d8070sv*sd*bc*sc*i*
> alias:  pci:v1077d1664sv*sd*bc*sc*i*
> alias:  pci:v1077d1656sv*sd*bc*sc*i*
> alias:  pci:v1077d1654sv*sd*bc*sc*i*
> alias:  pci:v1077d1644sv*sd*bc*sc*i*
> alias:  pci:v1077d1636sv*sd*bc*sc*i*
> alias:  pci:v1077d1666sv*sd*bc*sc*i*
> alias:  pci:v1077d1634sv*sd*bc*sc*i*
> depends:ptp,qed
> intree: Y
> vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
> signer: CentOS Linux kernel signing key
> sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
> sig_hashalgo:   sha256
> parm:   debug: Default debug msglevel (uint)
>
> modinfo qedr
>
> filename:
> /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe
> dr.ko.xz
> license:Dual BSD/GPL
> author: QLogic Corporation
> description:QLogic 40G/100G ROCE Driver
> retpoline:  Y
> rhelversion:7.7
> srcversion: B5B65473217AA2B1F2F619B
> depends:qede,qed,ib_core
> intree: Y
> vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
> signer: CentOS Linux kernel signing key
> sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
> sig_hashalgo:   sha256
>
> ibv_devinfo
>
> hca_id: qedr0
> transport: InfiniBand (0)
> fw_ver: 8.37.7.0
> node_guid: b62e:99ff:fea7:8439
> sys_image_guid: b62e:99ff:fea7:8439
> vendor_id: 0x1077
> vendor_part_id: 32880
> hw_ver: 0x0
> phys_port_cnt: 1
> port: 1
> state: PORT_ACTIVE (4)
> max_mtu: 4096 (5)
> active_mtu: 1024 (3)
> sm_lid: 0
> port_lid: 0
> port_lmc: 0x00
> link_layer: Ethernet
>
> hca_id: qedr1

Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.

2019-11-13 Thread Matteo Guglielmi via users
I cannot find a firmware for my card:


https://www.gigabyte.com/za/Accessory/CLNOQ42-rev-10#ov


Do you have the same model?



I found this zip file of the web:


Linux_FWupg_41xxx_2.10.78.zip

which contains a firmware upgrade tool and a firmware
version 8.50.83, but when I run it I get this error
message (card is not supported):


./LnxQlgcUpg.sh
Extracting package contents...

QLogic Firmware Upgrade Utility for Linux: v2.10.78

NIC is not supported.
Quitting program ...
Program Exit Code: (16)
Failed to upgraded MBI

thank you.


From: Llolsten Kaonga 
Sent: Wednesday, November 13, 2019 3:25:16 PM
To: 'Open MPI Users'
Cc: Matteo Guglielmi
Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for 
device.

Hello Mateo,

What version of openmpi are you running?

Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It
supports CentsOS 7.6.

Apologies if you have already tried CentOS 7.6.

We have been able to run openmpi (earlier this month):

OS:  CentOS 7.6
mpirun --version:3.1.4
ofed_info -s:OFED-4.17-1

RNIC fw version  8.50.9.0

Thanks.
--
Llolsten

-Original Message-
From: users  On Behalf Of Matteo Guglielmi
via users
Sent: Wednesday, November 13, 2019 2:12 AM
To: users@lists.open-mpi.org
Cc: Matteo Guglielmi 
Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for
device.

I'm trying to get openmpi over RoCE working with this setup:




card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov


OS: CentOS 7.7


modinfo qede

filename:
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q
ede/qede.ko.xz
version:8.37.0.20
license:GPL
description:QLogic FastLinQ 4 Ethernet Driver
retpoline:  Y
rhelversion:7.7
srcversion: A6AFD0788918644F2EFFF31
alias:  pci:v1077d8090sv*sd*bc*sc*i*
alias:  pci:v1077d8070sv*sd*bc*sc*i*
alias:  pci:v1077d1664sv*sd*bc*sc*i*
alias:  pci:v1077d1656sv*sd*bc*sc*i*
alias:  pci:v1077d1654sv*sd*bc*sc*i*
alias:  pci:v1077d1644sv*sd*bc*sc*i*
alias:  pci:v1077d1636sv*sd*bc*sc*i*
alias:  pci:v1077d1666sv*sd*bc*sc*i*
alias:  pci:v1077d1634sv*sd*bc*sc*i*
depends:ptp,qed
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256
parm:   debug: Default debug msglevel (uint)

modinfo qedr

filename:
/lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe
dr.ko.xz
license:Dual BSD/GPL
author: QLogic Corporation
description:QLogic 40G/100G ROCE Driver
retpoline:  Y
rhelversion:7.7
srcversion: B5B65473217AA2B1F2F619B
depends:qede,qed,ib_core
intree: Y
vermagic:   3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F
sig_hashalgo:   sha256

ibv_devinfo

hca_id: qedr0
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:8439
sys_image_guid: b62e:99ff:fea7:8439
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

hca_id: qedr1
transport: InfiniBand (0)
fw_ver: 8.37.7.0
node_guid: b62e:99ff:fea7:843a
sys_image_guid: b62e:99ff:fea7:843a
vendor_id: 0x1077
vendor_part_id: 32880
hw_ver: 0x0
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet




RDMA actually works at system level which means that I cand do

rdma ping-pong tests etc.



But when I try to run openmpi with these options:



mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ...





I get the following error messages:




--
WARNING: There is at least non-excluded one OpenFabrics device found, but
there are no active ports detected (or Open MPI was unable to use them).
This is most certainly not what you wanted.  Check your cables, subnet
manager configuration, etc.  The openib BTL will be ignored for this job.

  Local host: node001
--
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
qelr_alloc_context: Failed to allocate context for device.
--

[OMPI users] MPI_Iallreduce with multidimensional Fortran array

2019-11-13 Thread Camille Coti via users

Dear all,

I have a little piece of code shown below that initializes a 
multidimensional Fortran array and performs:

- a non-blocking MPI_Iallreduce immediately followed by an MPI_Wait
- a blocking MPI_Allreduce
After both calls, it displays a few elements of the input and output 
buffers.


In the output I am showing below, the first column gives the indices of 
the element displayed, the second column gives the corresponding element 
in the input array, the third column gives the corresponding element in 
the output array. All the processes have the same input array so the 
output should just be a multiple of the output.


I tried to compile and execute it with OpenMPI 4.0.1 on a single node, I 
get:


coti@xxx:~$ mpiexec -n 4 test_allreduce
 Rank   3  /    4
 Rank   1  /    4
 Rank   0  /    4
 Rank   2  /    4
 Non-blocking
 1,1,1,1   5  1252991616
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6   21197
 
 Blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 

I just cloned the master branch of the Git repository and compiled it 
(hash db52da40c379610360676f225cd7c767e5a964d3), with the following 
configuration line:

  $ ./configure --prefix=<> --enable-mpi-fortran=usempi

I get:

coti@yyy:~$ mpiexec --mca btl vader,self -n 4 ./test_allreduce
 Rank   0  /    4
 Rank   1  /    4
 Rank   2  /    4
 Rank   3  /    4
 Non-blocking
 1,1,1,1   5 -1092661536
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6 -1354461780
 2,1,1,1   6  130622
 
 Blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 

I have tried it with other MPI implementations (Intel MPI 19 and MPICH 
3.3), and they gave me the same output with the blocking and 
non-blocking calls:


coti@yyy:~$ mpiexec -n 4 ./test_allreduce
 Rank   0  /    4
 Rank   1  /    4
 Rank   2  /    4
 Rank   3  /    4
 Non-blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 
 Blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 

Is there anything wrong with my call to MPI_Iallreduce/MPI_Wait?

Thanks,
Camille


$ cat test_allreduce.f90
program main
  use mpi

  integer, allocatable, dimension(:,:,:,:,:) :: buff_in
  integer, allocatable, dimension(:,:,:,:) :: buff_out
  integer :: N, rank, size, err, i, j, k, l, m
  integer :: req

  N = 8

  allocate( buff_in( N, N, N, N, N ) )
  allocate( buff_out( N, N, N, N ) )

  call mpi_init( err )
  call mpi_comm_rank( mpi_comm_world, rank, err )
  call mpi_comm_size( mpi_comm_world, size, err )

  write( 6, * ) "Rank", rank, " / ", size

  do i=1, N
 do j=1, N
    do k=1, N
   do l=1, N
  do m=1, N
 buff_in( i, j, k, l, m ) = i + j + k + l + m
  end do
   end do
    end do
 end do
  end do

  buff_out( :,:,:,: ) = 0

! non-blocking

  call mpi_iallreduce( buff_in( 1, :, :, :, : ), buff_out, N*N*N*N, 
MPI_INT, MPI_SUM, mpi_comm_world, req, err )

  call mpi_wait( req, MPI_STATUS_IGNORE, err )

  if( 0 == rank ) then
 write( 6, * ) "Non-blocking"
 write( 6, * ) "1,1,1,1", buff_in( 1, 1, 1, 1, 1 ), buff_out( 1, 1, 
1, 1 )
 write( 6, * ) "1,1,1,2", buff_in( 1, 1, 1, 1, 2 ), buff_out( 1, 1, 
1, 2 )
 write( 6, * ) "1,1,1,3", buff_in( 1, 1, 1, 1, 3 ), buff_out( 1, 1, 
1, 3 )
 write( 6, * ) "1,1,1,4", buff_in( 1, 1, 1, 1, 4 ), buff_out( 1, 1, 
1, 4 )
 write( 6, * ) "1,1,1,5", buff_in( 1, 1, 1, 1, 5 ), buff_out( 1, 1, 
1, 5 )

 write( 6, * ) ""
 write( 6, * ) "1,1,2,1", buff_in( 1, 1, 1, 2, 1 ), buff_out( 1, 1, 
2, 1 )
 write( 6, * ) "1,2,1,1", buff_in( 1, 1, 2, 1, 1

Re: [OMPI users] MPI_Iallreduce with multidimensional Fortran array

2019-11-13 Thread Gilles Gouaillardet via users

Camille,


your program is only valid with a MPI library that features 
|MPI_SUBARRAYS_SUPPORTED|


and this is not (yet) the case in Open MPI.


A possible fix is to use an intermediate contiguous buffer

  integer, allocatable, dimension(:,:,:,:) :: tmp
  allocate( tmp(N,N,N,N) )

and then replace

  call mpi_iallreduce( buff_in( 1, :, :, :, : ), buff_out, N*N*N*N, 
MPI_INT, MPI_SUM, mpi_comm_world, req, err )


with

  tmp = buff_in(1, :, :, :, :)
  call mpi_iallreduce( tmp, buff_out, N*N*N*N, MPI_INT, MPI_SUM, 
mpi_comm_world, req, err )



What currently happens with your program is that buff_in(1, :, :, :, :) 
is transparently copied into a contiguous buffer by the Fortran runtime,


and passed to MPI_Iallreduce. Then this temporary buffer is freed when 
MPI_Iallreduce completes, and hence **before** MPI_Wait() completes,


so the behavior of such a program is undefined.



Cheers,


Gilles


On 11/14/2019 9:42 AM, Camille Coti via users wrote:

Dear all,

I have a little piece of code shown below that initializes a 
multidimensional Fortran array and performs:

- a non-blocking MPI_Iallreduce immediately followed by an MPI_Wait
- a blocking MPI_Allreduce
After both calls, it displays a few elements of the input and output 
buffers.


In the output I am showing below, the first column gives the indices 
of the element displayed, the second column gives the corresponding 
element in the input array, the third column gives the corresponding 
element in the output array. All the processes have the same input 
array so the output should just be a multiple of the output.


I tried to compile and execute it with OpenMPI 4.0.1 on a single node, 
I get:


coti@xxx:~$ mpiexec -n 4 test_allreduce
 Rank   3  /    4
 Rank   1  /    4
 Rank   0  /    4
 Rank   2  /    4
 Non-blocking
 1,1,1,1   5  1252991616
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6   21197
 
 Blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 

I just cloned the master branch of the Git repository and compiled it 
(hash db52da40c379610360676f225cd7c767e5a964d3), with the following 
configuration line:

  $ ./configure --prefix=<> --enable-mpi-fortran=usempi

I get:

coti@yyy:~$ mpiexec --mca btl vader,self -n 4 ./test_allreduce
 Rank   0  /    4
 Rank   1  /    4
 Rank   2  /    4
 Rank   3  /    4
 Non-blocking
 1,1,1,1   5 -1092661536
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6 -1354461780
 2,1,1,1   6  130622
 
 Blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 

I have tried it with other MPI implementations (Intel MPI 19 and MPICH 
3.3), and they gave me the same output with the blocking and 
non-blocking calls:


coti@yyy:~$ mpiexec -n 4 ./test_allreduce
 Rank   0  /    4
 Rank   1  /    4
 Rank   2  /    4
 Rank   3  /    4
 Non-blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 
 Blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 

Is there anything wrong with my call to MPI_Iallreduce/MPI_Wait?

Thanks,
Camille


$ cat test_allreduce.f90
program main
  use mpi

  integer, allocatable, dimension(:,:,:,:,:) :: buff_in
  integer, allocatable, dimension(:,:,:,:) :: buff_out
  integer :: N, rank, size, err, i, j, k, l, m
  integer :: req

  N = 8

  allocate( buff_in( N, N, N, N, N ) )
  allocate( buff_out( N, N, N, N ) )

  call mpi_init( err )
  call mpi_comm_rank( mpi_comm_world, rank, err )
  call mpi_comm_size( mpi_comm_world, size, err )

  write( 6, * ) "Rank", rank, " / ", size

  do i=1, N
 do j=1, N
    do k=1, N
   do l=1, N
  do m

Re: [OMPI users] MPI_Iallreduce with multidimensional Fortran array

2019-11-13 Thread Camille Coti via users

Dear Gilles,

Thank you very much for your clear answer.

Camille

On 11/13/19 5:40 PM, Gilles Gouaillardet via users wrote:

Camille,


your program is only valid with a MPI library that features 
|MPI_SUBARRAYS_SUPPORTED|


and this is not (yet) the case in Open MPI.


A possible fix is to use an intermediate contiguous buffer

  integer, allocatable, dimension(:,:,:,:) :: tmp
  allocate( tmp(N,N,N,N) )

and then replace

  call mpi_iallreduce( buff_in( 1, :, :, :, : ), buff_out, N*N*N*N, 
MPI_INT, MPI_SUM, mpi_comm_world, req, err )


with

  tmp = buff_in(1, :, :, :, :)
  call mpi_iallreduce( tmp, buff_out, N*N*N*N, MPI_INT, MPI_SUM, 
mpi_comm_world, req, err )



What currently happens with your program is that buff_in(1, :, :, :, 
:) is transparently copied into a contiguous buffer by the Fortran 
runtime,


and passed to MPI_Iallreduce. Then this temporary buffer is freed when 
MPI_Iallreduce completes, and hence **before** MPI_Wait() completes,


so the behavior of such a program is undefined.



Cheers,


Gilles


On 11/14/2019 9:42 AM, Camille Coti via users wrote:

Dear all,

I have a little piece of code shown below that initializes a 
multidimensional Fortran array and performs:

- a non-blocking MPI_Iallreduce immediately followed by an MPI_Wait
- a blocking MPI_Allreduce
After both calls, it displays a few elements of the input and output 
buffers.


In the output I am showing below, the first column gives the indices 
of the element displayed, the second column gives the corresponding 
element in the input array, the third column gives the corresponding 
element in the output array. All the processes have the same input 
array so the output should just be a multiple of the output.


I tried to compile and execute it with OpenMPI 4.0.1 on a single 
node, I get:


coti@xxx:~$ mpiexec -n 4 test_allreduce
 Rank   3  /    4
 Rank   1  /    4
 Rank   0  /    4
 Rank   2  /    4
 Non-blocking
 1,1,1,1   5  1252991616
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6   21197
 
 Blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 

I just cloned the master branch of the Git repository and compiled it 
(hash db52da40c379610360676f225cd7c767e5a964d3), with the following 
configuration line:

  $ ./configure --prefix=<> --enable-mpi-fortran=usempi

I get:

coti@yyy:~$ mpiexec --mca btl vader,self -n 4 ./test_allreduce
 Rank   0  /    4
 Rank   1  /    4
 Rank   2  /    4
 Rank   3  /    4
 Non-blocking
 1,1,1,1   5 -1092661536
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6 -1354461780
 2,1,1,1   6  130622
 
 Blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 

I have tried it with other MPI implementations (Intel MPI 19 and 
MPICH 3.3), and they gave me the same output with the blocking and 
non-blocking calls:


coti@yyy:~$ mpiexec -n 4 ./test_allreduce
 Rank   0  /    4
 Rank   1  /    4
 Rank   2  /    4
 Rank   3  /    4
 Non-blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 
 Blocking
 1,1,1,1   5  20
 1,1,1,2   6  24
 1,1,1,3   7  28
 1,1,1,4   8  32
 1,1,1,5   9  36
 
 1,1,2,1   6  24
 1,2,1,1   6  24
 2,1,1,1   6  24
 

Is there anything wrong with my call to MPI_Iallreduce/MPI_Wait?

Thanks,
Camille


$ cat test_allreduce.f90
program main
  use mpi

  integer, allocatable, dimension(:,:,:,:,:) :: buff_in
  integer, allocatable, dimension(:,:,:,:) :: buff_out
  integer :: N, rank, size, err, i, j, k, l, m
  integer :: req

  N = 8

  allocate( buff_in( N, N, N, N, N ) )
  allocate( buff_out( N, N, N, N ) )

  call mpi_init( err )
  call mpi_comm_rank( mpi_comm_world, rank, err )
  call mpi_comm_size( mpi_comm_world, size, err )


Re: [OMPI users] OpenMPI - Job pauses and goes no further

2019-11-13 Thread Ralph Castain via users
Difficult to know what to say here. I have no idea what your program does after 
validating the license. Does it execute some kind of MPI collective operation? 
Does only one proc validate the license and all others just use it?

All I can tell from your output is that the procs all launched okay.
Ralph


On Sep 27, 2019, at 4:32 PM, Steven Hill via users mailto:users@lists.open-mpi.org> > wrote:

Any assistance with this would be greatly appreciated. I’m running CENTOS 7 
with Open MPI 1.10.7 We are using a product called XFlow by 3ds. I have been 
going back and forth trying to figure out why my OpenMPI job pause when 
expanding across more than one machine.
 I confirmed the OpenMPI environment variable paths to libraries and bin files 
are correct on all machines (Head Node and 3 Compute Nodes).
LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:
PATH=/usr/lib64/openmpi/bin:
 I can run an MPI Job to display the host name.  
mpirun -host srv-comp01,srv-comp02,srv-comp03 hostname
srv-comp02
srv-comp01
srv-comp03
 If I run the command which normally pauses and I just identify the same 
hostname twice, it works fine
i.e. mpirun -npernode 2 -host srv-comp01, srv-comp02 {command}
 At the suggestion of the vendor I tried I have tried “--mca btl tcp,self” the 
job still pauses at the same spot.
 The firewall is turned off on all machines. Password-less SSH works without 
issue. I have tested with this another product we use called starccm (has it’s 
own MPI Provider).
 I have not run hello_c or ring_c, I see them referenced in the FAQ “11. How 
can I diagnose problems when running across multiple hosts? “ I can’t see where 
to download them from.
 Here is a verbose output of the command. It always pauses at “[ INFO  ] 
License validation OK” and goes no further. I am able to run the job without 
MPI on a single host. I’m not sure where to go from here.
 [symapp@srv-comp-hn ~]$ mpirun --version
mpirun (Open MPI) 1.10.7
 [symapp@srv-comp-hn ~]$ mpirun -npernode 1 --mca plm_base_verbose 10 -host 
srv-comp01,srv-comp02,srv-comp03 
/mntnfs/eng-nfs/Apps/XFlow/engine-3d-mpi-ompi10 
/mntnfs/eng-nfs/jsmith/XFlow/Periodic/PeriodicCavity_MPI3.xfp -maxcpu=1
[srv-comp-hn:04909] mca: base: components_register: registering plm components
[srv-comp-hn:04909] mca: base: components_register: found loaded component 
isolated
[srv-comp-hn:04909] mca: base: components_register: component isolated has no 
register or open function
[srv-comp-hn:04909] mca: base: components_register: found loaded component rsh
[srv-comp-hn:04909] mca: base: components_register: component rsh register 
function successful
[srv-comp-hn:04909] mca: base: components_register: found loaded component slurm
[srv-comp-hn:04909] mca: base: components_register: component slurm register 
function successful
[srv-comp-hn:04909] mca: base: components_open: opening plm components
[srv-comp-hn:04909] mca: base: components_open: found loaded component isolated
[srv-comp-hn:04909] mca: base: components_open: component isolated open 
function successful
[srv-comp-hn:04909] mca: base: components_open: found loaded component rsh
[srv-comp-hn:04909] mca: base: components_open: component rsh open function 
successful
[srv-comp-hn:04909] mca: base: components_open: found loaded component slurm
[srv-comp-hn:04909] mca: base: components_open: component slurm open function 
successful
[srv-comp-hn:04909] mca:base:select: Auto-selecting plm components
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [isolated]
[srv-comp-hn:04909] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [rsh]
[srv-comp-hn:04909] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [slurm]
[srv-comp-hn:04909] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[srv-comp-hn:04909] mca:base:select:(  plm) Selected component [rsh]
[srv-comp-hn:04909] mca: base: close: component isolated closed
[srv-comp-hn:04909] mca: base: close: unloading component isolated
[srv-comp-hn:04909] mca: base: close: component slurm closed
[srv-comp-hn:04909] mca: base: close: unloading component slurm
[srv-comp-hn:04909] [[15143,0],0] plm:rsh: final template argv:
    /usr/bin/ssh   orted --hnp-topo-sig 
0N:4S:4L3:4L2:4L1:8C:8H:x86_64 -mca ess "env" -mca orte_ess_jobid "992411648" 
-mca orte_ess_vpid "" -mca orte_ess_num_procs "4" -mca orte_hnp_uri 
"992411648.0;tcp://10.1.28.49,192.168.122.1:33405" --tree-spawn --mca 
plm_base_verbose "10" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" --tree-spawn
[srv-comp01:130272] mca: base: components_register: registering plm components
[srv-comp01:130272] mca: base: components_register: found loaded component rsh
[srv-comp01:130272] mca: base: components_register: component rsh register 
function successful
[srv-comp01:130272] mca: base: components_open: opening plm components
[srv-comp0

Re: [OMPI users] OpenMPI - Job pauses and goes no further

2019-11-13 Thread Jeff Squyres (jsquyres) via users
Agree with Ralph.  Your next step is to try what is suggested in the FAQ: run 
hello_c and ring_c.

They are in the examples/ directory in the source tarball.  Once Open MPI is 
installed (and things like "mpicc" can be found in your $PATH), you can just cd 
in there and run "make" to build them.



On Nov 13, 2019, at 8:58 PM, Ralph Castain via users 
mailto:users@lists.open-mpi.org>> wrote:

Difficult to know what to say here. I have no idea what your program does after 
validating the license. Does it execute some kind of MPI collective operation? 
Does only one proc validate the license and all others just use it?

All I can tell from your output is that the procs all launched okay.
Ralph


On Sep 27, 2019, at 4:32 PM, Steven Hill via users 
mailto:users@lists.open-mpi.org>> wrote:

Any assistance with this would be greatly appreciated. I’m running CENTOS 7 
with Open MPI 1.10.7 We are using a product called XFlow by 3ds. I have been 
going back and forth trying to figure out why my OpenMPI job pause when 
expanding across more than one machine.

I confirmed the OpenMPI environment variable paths to libraries and bin files 
are correct on all machines (Head Node and 3 Compute Nodes).
LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:
PATH=/usr/lib64/openmpi/bin:

I can run an MPI Job to display the host name.
mpirun -host srv-comp01,srv-comp02,srv-comp03 hostname
srv-comp02
srv-comp01
srv-comp03

If I run the command which normally pauses and I just identify the same 
hostname twice, it works fine
i.e. mpirun -npernode 2 -host srv-comp01, srv-comp02 {command}

At the suggestion of the vendor I tried I have tried “--mca btl tcp,self” the 
job still pauses at the same spot.

The firewall is turned off on all machines. Password-less SSH works without 
issue. I have tested with this another product we use called starccm (has it’s 
own MPI Provider).

I have not run hello_c or ring_c, I see them referenced in the FAQ “11. How can 
I diagnose problems when running across multiple hosts? “ I can’t see where to 
download them from.

Here is a verbose output of the command. It always pauses at “[ INFO  ] License 
validation OK” and goes no further. I am able to run the job without MPI on a 
single host. I’m not sure where to go from here.

[symapp@srv-comp-hn ~]$ mpirun --version
mpirun (Open MPI) 1.10.7

[symapp@srv-comp-hn ~]$ mpirun -npernode 1 --mca plm_base_verbose 10 -host 
srv-comp01,srv-comp02,srv-comp03 
/mntnfs/eng-nfs/Apps/XFlow/engine-3d-mpi-ompi10 
/mntnfs/eng-nfs/jsmith/XFlow/Periodic/PeriodicCavity_MPI3.xfp -maxcpu=1
[srv-comp-hn:04909] mca: base: components_register: registering plm components
[srv-comp-hn:04909] mca: base: components_register: found loaded component 
isolated
[srv-comp-hn:04909] mca: base: components_register: component isolated has no 
register or open function
[srv-comp-hn:04909] mca: base: components_register: found loaded component rsh
[srv-comp-hn:04909] mca: base: components_register: component rsh register 
function successful
[srv-comp-hn:04909] mca: base: components_register: found loaded component slurm
[srv-comp-hn:04909] mca: base: components_register: component slurm register 
function successful
[srv-comp-hn:04909] mca: base: components_open: opening plm components
[srv-comp-hn:04909] mca: base: components_open: found loaded component isolated
[srv-comp-hn:04909] mca: base: components_open: component isolated open 
function successful
[srv-comp-hn:04909] mca: base: components_open: found loaded component rsh
[srv-comp-hn:04909] mca: base: components_open: component rsh open function 
successful
[srv-comp-hn:04909] mca: base: components_open: found loaded component slurm
[srv-comp-hn:04909] mca: base: components_open: component slurm open function 
successful
[srv-comp-hn:04909] mca:base:select: Auto-selecting plm components
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [isolated]
[srv-comp-hn:04909] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [rsh]
[srv-comp-hn:04909] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[srv-comp-hn:04909] mca:base:select:(  plm) Querying component [slurm]
[srv-comp-hn:04909] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[srv-comp-hn:04909] mca:base:select:(  plm) Selected component [rsh]
[srv-comp-hn:04909] mca: base: close: component isolated closed
[srv-comp-hn:04909] mca: base: close: unloading component isolated
[srv-comp-hn:04909] mca: base: close: component slurm closed
[srv-comp-hn:04909] mca: base: close: unloading component slurm
[srv-comp-hn:04909] [[15143,0],0] plm:rsh: final template argv:
/usr/bin/ssh   orted --hnp-topo-sig 
0N:4S:4L3:4L2:4L1:8C:8H:x86_64 -mca ess "env" -mca orte_ess_jobid "992411648" 
-mca orte_ess_vpid "" -mca orte_ess_num_procs "4" -mca orte_hnp_uri 
"992411648.0;tcp://10.1.28.49,192.168.122.1:33405" --tree-spawn --mca 
plm_base_verbose "1