Re: [OMPI users] Change behavior of --output-filename
Ok thank you for the github links. I missed those. But the question remains if the old functionality in ./orte/orted/orted_main.c is still accessible, by some configuration parameters. I will also push the github questions.
Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.
Hello Mateo, What version of openmpi are you running? Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It supports CentsOS 7.6. Apologies if you have already tried CentOS 7.6. We have been able to run openmpi (earlier this month): OS: CentOS 7.6 mpirun --version: 3.1.4 ofed_info -s: OFED-4.17-1 RNIC fw version 8.50.9.0 Thanks. -- Llolsten -Original Message- From: users On Behalf Of Matteo Guglielmi via users Sent: Wednesday, November 13, 2019 2:12 AM To: users@lists.open-mpi.org Cc: Matteo Guglielmi Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for device. I'm trying to get openmpi over RoCE working with this setup: card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov OS: CentOS 7.7 modinfo qede filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q ede/qede.ko.xz version:8.37.0.20 license:GPL description:QLogic FastLinQ 4 Ethernet Driver retpoline: Y rhelversion:7.7 srcversion: A6AFD0788918644F2EFFF31 alias: pci:v1077d8090sv*sd*bc*sc*i* alias: pci:v1077d8070sv*sd*bc*sc*i* alias: pci:v1077d1664sv*sd*bc*sc*i* alias: pci:v1077d1656sv*sd*bc*sc*i* alias: pci:v1077d1654sv*sd*bc*sc*i* alias: pci:v1077d1644sv*sd*bc*sc*i* alias: pci:v1077d1636sv*sd*bc*sc*i* alias: pci:v1077d1666sv*sd*bc*sc*i* alias: pci:v1077d1634sv*sd*bc*sc*i* depends:ptp,qed intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 parm: debug: Default debug msglevel (uint) modinfo qedr filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe dr.ko.xz license:Dual BSD/GPL author: QLogic Corporation description:QLogic 40G/100G ROCE Driver retpoline: Y rhelversion:7.7 srcversion: B5B65473217AA2B1F2F619B depends:qede,qed,ib_core intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 ibv_devinfo hca_id: qedr0 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:8439 sys_image_guid: b62e:99ff:fea7:8439 vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: qedr1 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:843a sys_image_guid: b62e:99ff:fea7:843a vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet RDMA actually works at system level which means that I cand do rdma ping-pong tests etc. But when I try to run openmpi with these options: mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ... I get the following error messages: -- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: node001 -- qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. -- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: node002 Local device: qedr0 Local port: 1 CPCs attempted: rdmacm -- qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. ... I've tried several things such as: 1) upgrade the 3.10 kernel's qed* drivers to the latest stable version 8.42.9 2) upgrade the CentOS kernel from 3.10 to 5.3 via elrepo 3) install the latest OFED-4.17-1.tgz stack but the error messages never go away ad do remain always the
Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.
I rolled everything back to stock centos 7.7 installing OFED via: yum groupinstall @infiniband yum install rdma-core-devel infiniband-diags-devel which does not install the ofed_info command, or at least I could not find it (do you know where it is?). openmpi is version 3.1.4 the fw version should be 8.37.7.0 will now try to upgrade the firmware since changing OS is not an option. Other suggestions? Thank you! From: Llolsten Kaonga Sent: Wednesday, November 13, 2019 3:25:16 PM To: 'Open MPI Users' Cc: Matteo Guglielmi Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for device. Hello Mateo, What version of openmpi are you running? Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It supports CentsOS 7.6. Apologies if you have already tried CentOS 7.6. We have been able to run openmpi (earlier this month): OS: CentOS 7.6 mpirun --version:3.1.4 ofed_info -s:OFED-4.17-1 RNIC fw version 8.50.9.0 Thanks. -- Llolsten -Original Message- From: users On Behalf Of Matteo Guglielmi via users Sent: Wednesday, November 13, 2019 2:12 AM To: users@lists.open-mpi.org Cc: Matteo Guglielmi Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for device. I'm trying to get openmpi over RoCE working with this setup: card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov OS: CentOS 7.7 modinfo qede filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q ede/qede.ko.xz version:8.37.0.20 license:GPL description:QLogic FastLinQ 4 Ethernet Driver retpoline: Y rhelversion:7.7 srcversion: A6AFD0788918644F2EFFF31 alias: pci:v1077d8090sv*sd*bc*sc*i* alias: pci:v1077d8070sv*sd*bc*sc*i* alias: pci:v1077d1664sv*sd*bc*sc*i* alias: pci:v1077d1656sv*sd*bc*sc*i* alias: pci:v1077d1654sv*sd*bc*sc*i* alias: pci:v1077d1644sv*sd*bc*sc*i* alias: pci:v1077d1636sv*sd*bc*sc*i* alias: pci:v1077d1666sv*sd*bc*sc*i* alias: pci:v1077d1634sv*sd*bc*sc*i* depends:ptp,qed intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 parm: debug: Default debug msglevel (uint) modinfo qedr filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe dr.ko.xz license:Dual BSD/GPL author: QLogic Corporation description:QLogic 40G/100G ROCE Driver retpoline: Y rhelversion:7.7 srcversion: B5B65473217AA2B1F2F619B depends:qede,qed,ib_core intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 ibv_devinfo hca_id: qedr0 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:8439 sys_image_guid: b62e:99ff:fea7:8439 vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: qedr1 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:843a sys_image_guid: b62e:99ff:fea7:843a vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet RDMA actually works at system level which means that I cand do rdma ping-pong tests etc. But when I try to run openmpi with these options: mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ... I get the following error messages: -- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: node001 -- qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. -- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such
Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.
Have you tried using the UCX PML? The UCX PML is Mellanox's preferred Open MPI mechanism (instead of using the openib BTL). > On Nov 13, 2019, at 9:35 AM, Matteo Guglielmi via users > wrote: > > I rolled everything back to stock centos 7.7 installing OFED via: > > > > > yum groupinstall @infiniband > > yum install rdma-core-devel infiniband-diags-devel > > > which does not install the ofed_info command, or at least I could > not find it (do you know where it is?). > > > > openmpi is version 3.1.4 > > > > > the fw version should be 8.37.7.0 > > > > will now try to upgrade the firmware since changing OS is not an option. > > > > Other suggestions? > > > Thank you! > > > > From: Llolsten Kaonga > Sent: Wednesday, November 13, 2019 3:25:16 PM > To: 'Open MPI Users' > Cc: Matteo Guglielmi > Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for > device. > > Hello Mateo, > > What version of openmpi are you running? > > Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It > supports CentsOS 7.6. > > Apologies if you have already tried CentOS 7.6. > > We have been able to run openmpi (earlier this month): > > OS: CentOS 7.6 > mpirun --version:3.1.4 > ofed_info -s:OFED-4.17-1 > > RNIC fw version 8.50.9.0 > > Thanks. > -- > Llolsten > > -Original Message- > From: users On Behalf Of Matteo Guglielmi > via users > Sent: Wednesday, November 13, 2019 2:12 AM > To: users@lists.open-mpi.org > Cc: Matteo Guglielmi > Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for > device. > > I'm trying to get openmpi over RoCE working with this setup: > > > > > card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov > > > OS: CentOS 7.7 > > > modinfo qede > > filename: > /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q > ede/qede.ko.xz > version:8.37.0.20 > license:GPL > description:QLogic FastLinQ 4 Ethernet Driver > retpoline: Y > rhelversion:7.7 > srcversion: A6AFD0788918644F2EFFF31 > alias: pci:v1077d8090sv*sd*bc*sc*i* > alias: pci:v1077d8070sv*sd*bc*sc*i* > alias: pci:v1077d1664sv*sd*bc*sc*i* > alias: pci:v1077d1656sv*sd*bc*sc*i* > alias: pci:v1077d1654sv*sd*bc*sc*i* > alias: pci:v1077d1644sv*sd*bc*sc*i* > alias: pci:v1077d1636sv*sd*bc*sc*i* > alias: pci:v1077d1666sv*sd*bc*sc*i* > alias: pci:v1077d1634sv*sd*bc*sc*i* > depends:ptp,qed > intree: Y > vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions > signer: CentOS Linux kernel signing key > sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F > sig_hashalgo: sha256 > parm: debug: Default debug msglevel (uint) > > modinfo qedr > > filename: > /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe > dr.ko.xz > license:Dual BSD/GPL > author: QLogic Corporation > description:QLogic 40G/100G ROCE Driver > retpoline: Y > rhelversion:7.7 > srcversion: B5B65473217AA2B1F2F619B > depends:qede,qed,ib_core > intree: Y > vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions > signer: CentOS Linux kernel signing key > sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F > sig_hashalgo: sha256 > > ibv_devinfo > > hca_id: qedr0 > transport: InfiniBand (0) > fw_ver: 8.37.7.0 > node_guid: b62e:99ff:fea7:8439 > sys_image_guid: b62e:99ff:fea7:8439 > vendor_id: 0x1077 > vendor_part_id: 32880 > hw_ver: 0x0 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 1024 (3) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: Ethernet > > hca_id: qedr1 > transport: InfiniBand (0) > fw_ver: 8.37.7.0 > node_guid: b62e:99ff:fea7:843a > sys_image_guid: b62e:99ff:fea7:843a > vendor_id: 0x1077 > vendor_part_id: 32880 > hw_ver: 0x0 > phys_port_cnt: 1 > port: 1 > state: PORT_DOWN (1) > max_mtu: 4096 (5) > active_mtu: 1024 (3) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: Ethernet > > > > > RDMA actually works at system level which means that I cand do > > rdma ping-pong tests etc. > > > > But when I try to run openmpi with these options: > > > > mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ... > > > > > > I get the following error messages: > > > > > -- > WARNING: There is at least non-excluded one OpenFabrics device found, but > there are no active ports detected (or Open MPI was unable to use them). > This is most certainly not what you wanted. Check your cables, subnet > manager configuration, etc. The openib BTL will be
Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.
I'm not using Mellanox OFED because the card is a Marvell OCP type 25Gb/s 2-port LAN Card. Kernel drivers used are: qede + qedr Beside that, I did a quick test on two nodes installing CentSO 7.6 and: ofed_info -s OFED-4.17-1: and now the error message is different: -- [[30578,1],1]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: node001 Another transport will be used instead, although this may result in lower performance. NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0. -- From: Jeff Squyres (jsquyres) Sent: Wednesday, November 13, 2019 7:16:41 PM To: Open MPI User's List Cc: Llolsten Kaonga; Matteo Guglielmi Subject: Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device. Have you tried using the UCX PML? The UCX PML is Mellanox's preferred Open MPI mechanism (instead of using the openib BTL). > On Nov 13, 2019, at 9:35 AM, Matteo Guglielmi via users > wrote: > > I rolled everything back to stock centos 7.7 installing OFED via: > > > > > yum groupinstall @infiniband > > yum install rdma-core-devel infiniband-diags-devel > > > which does not install the ofed_info command, or at least I could > not find it (do you know where it is?). > > > > openmpi is version 3.1.4 > > > > > the fw version should be 8.37.7.0 > > > > will now try to upgrade the firmware since changing OS is not an option. > > > > Other suggestions? > > > Thank you! > > > > From: Llolsten Kaonga > Sent: Wednesday, November 13, 2019 3:25:16 PM > To: 'Open MPI Users' > Cc: Matteo Guglielmi > Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for > device. > > Hello Mateo, > > What version of openmpi are you running? > > Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It > supports CentsOS 7.6. > > Apologies if you have already tried CentOS 7.6. > > We have been able to run openmpi (earlier this month): > > OS: CentOS 7.6 > mpirun --version:3.1.4 > ofed_info -s:OFED-4.17-1 > > RNIC fw version 8.50.9.0 > > Thanks. > -- > Llolsten > > -Original Message- > From: users On Behalf Of Matteo Guglielmi > via users > Sent: Wednesday, November 13, 2019 2:12 AM > To: users@lists.open-mpi.org > Cc: Matteo Guglielmi > Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for > device. > > I'm trying to get openmpi over RoCE working with this setup: > > > > > card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov > > > OS: CentOS 7.7 > > > modinfo qede > > filename: > /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q > ede/qede.ko.xz > version:8.37.0.20 > license:GPL > description:QLogic FastLinQ 4 Ethernet Driver > retpoline: Y > rhelversion:7.7 > srcversion: A6AFD0788918644F2EFFF31 > alias: pci:v1077d8090sv*sd*bc*sc*i* > alias: pci:v1077d8070sv*sd*bc*sc*i* > alias: pci:v1077d1664sv*sd*bc*sc*i* > alias: pci:v1077d1656sv*sd*bc*sc*i* > alias: pci:v1077d1654sv*sd*bc*sc*i* > alias: pci:v1077d1644sv*sd*bc*sc*i* > alias: pci:v1077d1636sv*sd*bc*sc*i* > alias: pci:v1077d1666sv*sd*bc*sc*i* > alias: pci:v1077d1634sv*sd*bc*sc*i* > depends:ptp,qed > intree: Y > vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions > signer: CentOS Linux kernel signing key > sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F > sig_hashalgo: sha256 > parm: debug: Default debug msglevel (uint) > > modinfo qedr > > filename: > /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe > dr.ko.xz > license:Dual BSD/GPL > author: QLogic Corporation > description:QLogic 40G/100G ROCE Driver > retpoline: Y > rhelversion:7.7 > srcversion: B5B65473217AA2B1F2F619B > depends:qede,qed,ib_core > intree: Y > vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions > signer: CentOS Linux kernel signing key > sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F > sig_hashalgo: sha256 > > ibv_devinfo > > hca_id: qedr0 > transport: InfiniBand (0) > fw_ver: 8.37.7.0 > node_guid: b62e:99ff:fea7:8439 > sys_image_guid: b62e:99ff:fea7:8439 > vendor_id: 0x1077 > vendor_part_id: 32880 > hw_ver: 0x0 > phys_port_cnt: 1 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 4096 (5) > active_mtu: 1024 (3) > sm_lid: 0 > port_lid: 0 > port_lmc: 0x00 > link_layer: Ethernet > > hca_id: qedr1
Re: [OMPI users] qelr_alloc_context: Failed to allocate context for device.
I cannot find a firmware for my card: https://www.gigabyte.com/za/Accessory/CLNOQ42-rev-10#ov Do you have the same model? I found this zip file of the web: Linux_FWupg_41xxx_2.10.78.zip which contains a firmware upgrade tool and a firmware version 8.50.83, but when I run it I get this error message (card is not supported): ./LnxQlgcUpg.sh Extracting package contents... QLogic Firmware Upgrade Utility for Linux: v2.10.78 NIC is not supported. Quitting program ... Program Exit Code: (16) Failed to upgraded MBI thank you. From: Llolsten Kaonga Sent: Wednesday, November 13, 2019 3:25:16 PM To: 'Open MPI Users' Cc: Matteo Guglielmi Subject: RE: [OMPI users] qelr_alloc_context: Failed to allocate context for device. Hello Mateo, What version of openmpi are you running? Also, the OFED-4.17-1 release notes do not claim support for CentOS 7.7. It supports CentsOS 7.6. Apologies if you have already tried CentOS 7.6. We have been able to run openmpi (earlier this month): OS: CentOS 7.6 mpirun --version:3.1.4 ofed_info -s:OFED-4.17-1 RNIC fw version 8.50.9.0 Thanks. -- Llolsten -Original Message- From: users On Behalf Of Matteo Guglielmi via users Sent: Wednesday, November 13, 2019 2:12 AM To: users@lists.open-mpi.org Cc: Matteo Guglielmi Subject: [OMPI users] qelr_alloc_context: Failed to allocate context for device. I'm trying to get openmpi over RoCE working with this setup: card: https://www.gigabyte.com/Accessory/CLNOQ42-rev-10#ov OS: CentOS 7.7 modinfo qede filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/net/ethernet/qlogic/q ede/qede.ko.xz version:8.37.0.20 license:GPL description:QLogic FastLinQ 4 Ethernet Driver retpoline: Y rhelversion:7.7 srcversion: A6AFD0788918644F2EFFF31 alias: pci:v1077d8090sv*sd*bc*sc*i* alias: pci:v1077d8070sv*sd*bc*sc*i* alias: pci:v1077d1664sv*sd*bc*sc*i* alias: pci:v1077d1656sv*sd*bc*sc*i* alias: pci:v1077d1654sv*sd*bc*sc*i* alias: pci:v1077d1644sv*sd*bc*sc*i* alias: pci:v1077d1636sv*sd*bc*sc*i* alias: pci:v1077d1666sv*sd*bc*sc*i* alias: pci:v1077d1634sv*sd*bc*sc*i* depends:ptp,qed intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 parm: debug: Default debug msglevel (uint) modinfo qedr filename: /lib/modules/3.10.0-1062.4.1.el7.x86_64/kernel/drivers/infiniband/hw/qedr/qe dr.ko.xz license:Dual BSD/GPL author: QLogic Corporation description:QLogic 40G/100G ROCE Driver retpoline: Y rhelversion:7.7 srcversion: B5B65473217AA2B1F2F619B depends:qede,qed,ib_core intree: Y vermagic: 3.10.0-1062.4.1.el7.x86_64 SMP mod_unload modversions signer: CentOS Linux kernel signing key sig_key:60:48:F2:5B:83:1E:C4:47:02:00:E2:36:02:C5:CA:83:1D:18:CF:8F sig_hashalgo: sha256 ibv_devinfo hca_id: qedr0 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:8439 sys_image_guid: b62e:99ff:fea7:8439 vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet hca_id: qedr1 transport: InfiniBand (0) fw_ver: 8.37.7.0 node_guid: b62e:99ff:fea7:843a sys_image_guid: b62e:99ff:fea7:843a vendor_id: 0x1077 vendor_part_id: 32880 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet RDMA actually works at system level which means that I cand do rdma ping-pong tests etc. But when I try to run openmpi with these options: mpirun --mca btl openib,self,vader --mca btl_openib_cpc_include rdmacm ... I get the following error messages: -- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: node001 -- qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. qelr_alloc_context: Failed to allocate context for device. --
[OMPI users] MPI_Iallreduce with multidimensional Fortran array
Dear all, I have a little piece of code shown below that initializes a multidimensional Fortran array and performs: - a non-blocking MPI_Iallreduce immediately followed by an MPI_Wait - a blocking MPI_Allreduce After both calls, it displays a few elements of the input and output buffers. In the output I am showing below, the first column gives the indices of the element displayed, the second column gives the corresponding element in the input array, the third column gives the corresponding element in the output array. All the processes have the same input array so the output should just be a multiple of the output. I tried to compile and execute it with OpenMPI 4.0.1 on a single node, I get: coti@xxx:~$ mpiexec -n 4 test_allreduce Rank 3 / 4 Rank 1 / 4 Rank 0 / 4 Rank 2 / 4 Non-blocking 1,1,1,1 5 1252991616 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 21197 Blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 I just cloned the master branch of the Git repository and compiled it (hash db52da40c379610360676f225cd7c767e5a964d3), with the following configuration line: $ ./configure --prefix=<> --enable-mpi-fortran=usempi I get: coti@yyy:~$ mpiexec --mca btl vader,self -n 4 ./test_allreduce Rank 0 / 4 Rank 1 / 4 Rank 2 / 4 Rank 3 / 4 Non-blocking 1,1,1,1 5 -1092661536 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 -1354461780 2,1,1,1 6 130622 Blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 I have tried it with other MPI implementations (Intel MPI 19 and MPICH 3.3), and they gave me the same output with the blocking and non-blocking calls: coti@yyy:~$ mpiexec -n 4 ./test_allreduce Rank 0 / 4 Rank 1 / 4 Rank 2 / 4 Rank 3 / 4 Non-blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 Blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 Is there anything wrong with my call to MPI_Iallreduce/MPI_Wait? Thanks, Camille $ cat test_allreduce.f90 program main use mpi integer, allocatable, dimension(:,:,:,:,:) :: buff_in integer, allocatable, dimension(:,:,:,:) :: buff_out integer :: N, rank, size, err, i, j, k, l, m integer :: req N = 8 allocate( buff_in( N, N, N, N, N ) ) allocate( buff_out( N, N, N, N ) ) call mpi_init( err ) call mpi_comm_rank( mpi_comm_world, rank, err ) call mpi_comm_size( mpi_comm_world, size, err ) write( 6, * ) "Rank", rank, " / ", size do i=1, N do j=1, N do k=1, N do l=1, N do m=1, N buff_in( i, j, k, l, m ) = i + j + k + l + m end do end do end do end do end do buff_out( :,:,:,: ) = 0 ! non-blocking call mpi_iallreduce( buff_in( 1, :, :, :, : ), buff_out, N*N*N*N, MPI_INT, MPI_SUM, mpi_comm_world, req, err ) call mpi_wait( req, MPI_STATUS_IGNORE, err ) if( 0 == rank ) then write( 6, * ) "Non-blocking" write( 6, * ) "1,1,1,1", buff_in( 1, 1, 1, 1, 1 ), buff_out( 1, 1, 1, 1 ) write( 6, * ) "1,1,1,2", buff_in( 1, 1, 1, 1, 2 ), buff_out( 1, 1, 1, 2 ) write( 6, * ) "1,1,1,3", buff_in( 1, 1, 1, 1, 3 ), buff_out( 1, 1, 1, 3 ) write( 6, * ) "1,1,1,4", buff_in( 1, 1, 1, 1, 4 ), buff_out( 1, 1, 1, 4 ) write( 6, * ) "1,1,1,5", buff_in( 1, 1, 1, 1, 5 ), buff_out( 1, 1, 1, 5 ) write( 6, * ) "" write( 6, * ) "1,1,2,1", buff_in( 1, 1, 1, 2, 1 ), buff_out( 1, 1, 2, 1 ) write( 6, * ) "1,2,1,1", buff_in( 1, 1, 2, 1, 1
Re: [OMPI users] MPI_Iallreduce with multidimensional Fortran array
Camille, your program is only valid with a MPI library that features |MPI_SUBARRAYS_SUPPORTED| and this is not (yet) the case in Open MPI. A possible fix is to use an intermediate contiguous buffer integer, allocatable, dimension(:,:,:,:) :: tmp allocate( tmp(N,N,N,N) ) and then replace call mpi_iallreduce( buff_in( 1, :, :, :, : ), buff_out, N*N*N*N, MPI_INT, MPI_SUM, mpi_comm_world, req, err ) with tmp = buff_in(1, :, :, :, :) call mpi_iallreduce( tmp, buff_out, N*N*N*N, MPI_INT, MPI_SUM, mpi_comm_world, req, err ) What currently happens with your program is that buff_in(1, :, :, :, :) is transparently copied into a contiguous buffer by the Fortran runtime, and passed to MPI_Iallreduce. Then this temporary buffer is freed when MPI_Iallreduce completes, and hence **before** MPI_Wait() completes, so the behavior of such a program is undefined. Cheers, Gilles On 11/14/2019 9:42 AM, Camille Coti via users wrote: Dear all, I have a little piece of code shown below that initializes a multidimensional Fortran array and performs: - a non-blocking MPI_Iallreduce immediately followed by an MPI_Wait - a blocking MPI_Allreduce After both calls, it displays a few elements of the input and output buffers. In the output I am showing below, the first column gives the indices of the element displayed, the second column gives the corresponding element in the input array, the third column gives the corresponding element in the output array. All the processes have the same input array so the output should just be a multiple of the output. I tried to compile and execute it with OpenMPI 4.0.1 on a single node, I get: coti@xxx:~$ mpiexec -n 4 test_allreduce Rank 3 / 4 Rank 1 / 4 Rank 0 / 4 Rank 2 / 4 Non-blocking 1,1,1,1 5 1252991616 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 21197 Blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 I just cloned the master branch of the Git repository and compiled it (hash db52da40c379610360676f225cd7c767e5a964d3), with the following configuration line: $ ./configure --prefix=<> --enable-mpi-fortran=usempi I get: coti@yyy:~$ mpiexec --mca btl vader,self -n 4 ./test_allreduce Rank 0 / 4 Rank 1 / 4 Rank 2 / 4 Rank 3 / 4 Non-blocking 1,1,1,1 5 -1092661536 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 -1354461780 2,1,1,1 6 130622 Blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 I have tried it with other MPI implementations (Intel MPI 19 and MPICH 3.3), and they gave me the same output with the blocking and non-blocking calls: coti@yyy:~$ mpiexec -n 4 ./test_allreduce Rank 0 / 4 Rank 1 / 4 Rank 2 / 4 Rank 3 / 4 Non-blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 Blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 Is there anything wrong with my call to MPI_Iallreduce/MPI_Wait? Thanks, Camille $ cat test_allreduce.f90 program main use mpi integer, allocatable, dimension(:,:,:,:,:) :: buff_in integer, allocatable, dimension(:,:,:,:) :: buff_out integer :: N, rank, size, err, i, j, k, l, m integer :: req N = 8 allocate( buff_in( N, N, N, N, N ) ) allocate( buff_out( N, N, N, N ) ) call mpi_init( err ) call mpi_comm_rank( mpi_comm_world, rank, err ) call mpi_comm_size( mpi_comm_world, size, err ) write( 6, * ) "Rank", rank, " / ", size do i=1, N do j=1, N do k=1, N do l=1, N do m
Re: [OMPI users] MPI_Iallreduce with multidimensional Fortran array
Dear Gilles, Thank you very much for your clear answer. Camille On 11/13/19 5:40 PM, Gilles Gouaillardet via users wrote: Camille, your program is only valid with a MPI library that features |MPI_SUBARRAYS_SUPPORTED| and this is not (yet) the case in Open MPI. A possible fix is to use an intermediate contiguous buffer integer, allocatable, dimension(:,:,:,:) :: tmp allocate( tmp(N,N,N,N) ) and then replace call mpi_iallreduce( buff_in( 1, :, :, :, : ), buff_out, N*N*N*N, MPI_INT, MPI_SUM, mpi_comm_world, req, err ) with tmp = buff_in(1, :, :, :, :) call mpi_iallreduce( tmp, buff_out, N*N*N*N, MPI_INT, MPI_SUM, mpi_comm_world, req, err ) What currently happens with your program is that buff_in(1, :, :, :, :) is transparently copied into a contiguous buffer by the Fortran runtime, and passed to MPI_Iallreduce. Then this temporary buffer is freed when MPI_Iallreduce completes, and hence **before** MPI_Wait() completes, so the behavior of such a program is undefined. Cheers, Gilles On 11/14/2019 9:42 AM, Camille Coti via users wrote: Dear all, I have a little piece of code shown below that initializes a multidimensional Fortran array and performs: - a non-blocking MPI_Iallreduce immediately followed by an MPI_Wait - a blocking MPI_Allreduce After both calls, it displays a few elements of the input and output buffers. In the output I am showing below, the first column gives the indices of the element displayed, the second column gives the corresponding element in the input array, the third column gives the corresponding element in the output array. All the processes have the same input array so the output should just be a multiple of the output. I tried to compile and execute it with OpenMPI 4.0.1 on a single node, I get: coti@xxx:~$ mpiexec -n 4 test_allreduce Rank 3 / 4 Rank 1 / 4 Rank 0 / 4 Rank 2 / 4 Non-blocking 1,1,1,1 5 1252991616 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 21197 Blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 I just cloned the master branch of the Git repository and compiled it (hash db52da40c379610360676f225cd7c767e5a964d3), with the following configuration line: $ ./configure --prefix=<> --enable-mpi-fortran=usempi I get: coti@yyy:~$ mpiexec --mca btl vader,self -n 4 ./test_allreduce Rank 0 / 4 Rank 1 / 4 Rank 2 / 4 Rank 3 / 4 Non-blocking 1,1,1,1 5 -1092661536 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 -1354461780 2,1,1,1 6 130622 Blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 I have tried it with other MPI implementations (Intel MPI 19 and MPICH 3.3), and they gave me the same output with the blocking and non-blocking calls: coti@yyy:~$ mpiexec -n 4 ./test_allreduce Rank 0 / 4 Rank 1 / 4 Rank 2 / 4 Rank 3 / 4 Non-blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 Blocking 1,1,1,1 5 20 1,1,1,2 6 24 1,1,1,3 7 28 1,1,1,4 8 32 1,1,1,5 9 36 1,1,2,1 6 24 1,2,1,1 6 24 2,1,1,1 6 24 Is there anything wrong with my call to MPI_Iallreduce/MPI_Wait? Thanks, Camille $ cat test_allreduce.f90 program main use mpi integer, allocatable, dimension(:,:,:,:,:) :: buff_in integer, allocatable, dimension(:,:,:,:) :: buff_out integer :: N, rank, size, err, i, j, k, l, m integer :: req N = 8 allocate( buff_in( N, N, N, N, N ) ) allocate( buff_out( N, N, N, N ) ) call mpi_init( err ) call mpi_comm_rank( mpi_comm_world, rank, err ) call mpi_comm_size( mpi_comm_world, size, err )
Re: [OMPI users] OpenMPI - Job pauses and goes no further
Difficult to know what to say here. I have no idea what your program does after validating the license. Does it execute some kind of MPI collective operation? Does only one proc validate the license and all others just use it? All I can tell from your output is that the procs all launched okay. Ralph On Sep 27, 2019, at 4:32 PM, Steven Hill via users mailto:users@lists.open-mpi.org> > wrote: Any assistance with this would be greatly appreciated. I’m running CENTOS 7 with Open MPI 1.10.7 We are using a product called XFlow by 3ds. I have been going back and forth trying to figure out why my OpenMPI job pause when expanding across more than one machine. I confirmed the OpenMPI environment variable paths to libraries and bin files are correct on all machines (Head Node and 3 Compute Nodes). LD_LIBRARY_PATH=/usr/lib64/openmpi/lib: PATH=/usr/lib64/openmpi/bin: I can run an MPI Job to display the host name. mpirun -host srv-comp01,srv-comp02,srv-comp03 hostname srv-comp02 srv-comp01 srv-comp03 If I run the command which normally pauses and I just identify the same hostname twice, it works fine i.e. mpirun -npernode 2 -host srv-comp01, srv-comp02 {command} At the suggestion of the vendor I tried I have tried “--mca btl tcp,self” the job still pauses at the same spot. The firewall is turned off on all machines. Password-less SSH works without issue. I have tested with this another product we use called starccm (has it’s own MPI Provider). I have not run hello_c or ring_c, I see them referenced in the FAQ “11. How can I diagnose problems when running across multiple hosts? “ I can’t see where to download them from. Here is a verbose output of the command. It always pauses at “[ INFO ] License validation OK” and goes no further. I am able to run the job without MPI on a single host. I’m not sure where to go from here. [symapp@srv-comp-hn ~]$ mpirun --version mpirun (Open MPI) 1.10.7 [symapp@srv-comp-hn ~]$ mpirun -npernode 1 --mca plm_base_verbose 10 -host srv-comp01,srv-comp02,srv-comp03 /mntnfs/eng-nfs/Apps/XFlow/engine-3d-mpi-ompi10 /mntnfs/eng-nfs/jsmith/XFlow/Periodic/PeriodicCavity_MPI3.xfp -maxcpu=1 [srv-comp-hn:04909] mca: base: components_register: registering plm components [srv-comp-hn:04909] mca: base: components_register: found loaded component isolated [srv-comp-hn:04909] mca: base: components_register: component isolated has no register or open function [srv-comp-hn:04909] mca: base: components_register: found loaded component rsh [srv-comp-hn:04909] mca: base: components_register: component rsh register function successful [srv-comp-hn:04909] mca: base: components_register: found loaded component slurm [srv-comp-hn:04909] mca: base: components_register: component slurm register function successful [srv-comp-hn:04909] mca: base: components_open: opening plm components [srv-comp-hn:04909] mca: base: components_open: found loaded component isolated [srv-comp-hn:04909] mca: base: components_open: component isolated open function successful [srv-comp-hn:04909] mca: base: components_open: found loaded component rsh [srv-comp-hn:04909] mca: base: components_open: component rsh open function successful [srv-comp-hn:04909] mca: base: components_open: found loaded component slurm [srv-comp-hn:04909] mca: base: components_open: component slurm open function successful [srv-comp-hn:04909] mca:base:select: Auto-selecting plm components [srv-comp-hn:04909] mca:base:select:( plm) Querying component [isolated] [srv-comp-hn:04909] mca:base:select:( plm) Query of component [isolated] set priority to 0 [srv-comp-hn:04909] mca:base:select:( plm) Querying component [rsh] [srv-comp-hn:04909] mca:base:select:( plm) Query of component [rsh] set priority to 10 [srv-comp-hn:04909] mca:base:select:( plm) Querying component [slurm] [srv-comp-hn:04909] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [srv-comp-hn:04909] mca:base:select:( plm) Selected component [rsh] [srv-comp-hn:04909] mca: base: close: component isolated closed [srv-comp-hn:04909] mca: base: close: unloading component isolated [srv-comp-hn:04909] mca: base: close: component slurm closed [srv-comp-hn:04909] mca: base: close: unloading component slurm [srv-comp-hn:04909] [[15143,0],0] plm:rsh: final template argv: /usr/bin/ssh orted --hnp-topo-sig 0N:4S:4L3:4L2:4L1:8C:8H:x86_64 -mca ess "env" -mca orte_ess_jobid "992411648" -mca orte_ess_vpid "" -mca orte_ess_num_procs "4" -mca orte_hnp_uri "992411648.0;tcp://10.1.28.49,192.168.122.1:33405" --tree-spawn --mca plm_base_verbose "10" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" --tree-spawn [srv-comp01:130272] mca: base: components_register: registering plm components [srv-comp01:130272] mca: base: components_register: found loaded component rsh [srv-comp01:130272] mca: base: components_register: component rsh register function successful [srv-comp01:130272] mca: base: components_open: opening plm components [srv-comp0
Re: [OMPI users] OpenMPI - Job pauses and goes no further
Agree with Ralph. Your next step is to try what is suggested in the FAQ: run hello_c and ring_c. They are in the examples/ directory in the source tarball. Once Open MPI is installed (and things like "mpicc" can be found in your $PATH), you can just cd in there and run "make" to build them. On Nov 13, 2019, at 8:58 PM, Ralph Castain via users mailto:users@lists.open-mpi.org>> wrote: Difficult to know what to say here. I have no idea what your program does after validating the license. Does it execute some kind of MPI collective operation? Does only one proc validate the license and all others just use it? All I can tell from your output is that the procs all launched okay. Ralph On Sep 27, 2019, at 4:32 PM, Steven Hill via users mailto:users@lists.open-mpi.org>> wrote: Any assistance with this would be greatly appreciated. I’m running CENTOS 7 with Open MPI 1.10.7 We are using a product called XFlow by 3ds. I have been going back and forth trying to figure out why my OpenMPI job pause when expanding across more than one machine. I confirmed the OpenMPI environment variable paths to libraries and bin files are correct on all machines (Head Node and 3 Compute Nodes). LD_LIBRARY_PATH=/usr/lib64/openmpi/lib: PATH=/usr/lib64/openmpi/bin: I can run an MPI Job to display the host name. mpirun -host srv-comp01,srv-comp02,srv-comp03 hostname srv-comp02 srv-comp01 srv-comp03 If I run the command which normally pauses and I just identify the same hostname twice, it works fine i.e. mpirun -npernode 2 -host srv-comp01, srv-comp02 {command} At the suggestion of the vendor I tried I have tried “--mca btl tcp,self” the job still pauses at the same spot. The firewall is turned off on all machines. Password-less SSH works without issue. I have tested with this another product we use called starccm (has it’s own MPI Provider). I have not run hello_c or ring_c, I see them referenced in the FAQ “11. How can I diagnose problems when running across multiple hosts? “ I can’t see where to download them from. Here is a verbose output of the command. It always pauses at “[ INFO ] License validation OK” and goes no further. I am able to run the job without MPI on a single host. I’m not sure where to go from here. [symapp@srv-comp-hn ~]$ mpirun --version mpirun (Open MPI) 1.10.7 [symapp@srv-comp-hn ~]$ mpirun -npernode 1 --mca plm_base_verbose 10 -host srv-comp01,srv-comp02,srv-comp03 /mntnfs/eng-nfs/Apps/XFlow/engine-3d-mpi-ompi10 /mntnfs/eng-nfs/jsmith/XFlow/Periodic/PeriodicCavity_MPI3.xfp -maxcpu=1 [srv-comp-hn:04909] mca: base: components_register: registering plm components [srv-comp-hn:04909] mca: base: components_register: found loaded component isolated [srv-comp-hn:04909] mca: base: components_register: component isolated has no register or open function [srv-comp-hn:04909] mca: base: components_register: found loaded component rsh [srv-comp-hn:04909] mca: base: components_register: component rsh register function successful [srv-comp-hn:04909] mca: base: components_register: found loaded component slurm [srv-comp-hn:04909] mca: base: components_register: component slurm register function successful [srv-comp-hn:04909] mca: base: components_open: opening plm components [srv-comp-hn:04909] mca: base: components_open: found loaded component isolated [srv-comp-hn:04909] mca: base: components_open: component isolated open function successful [srv-comp-hn:04909] mca: base: components_open: found loaded component rsh [srv-comp-hn:04909] mca: base: components_open: component rsh open function successful [srv-comp-hn:04909] mca: base: components_open: found loaded component slurm [srv-comp-hn:04909] mca: base: components_open: component slurm open function successful [srv-comp-hn:04909] mca:base:select: Auto-selecting plm components [srv-comp-hn:04909] mca:base:select:( plm) Querying component [isolated] [srv-comp-hn:04909] mca:base:select:( plm) Query of component [isolated] set priority to 0 [srv-comp-hn:04909] mca:base:select:( plm) Querying component [rsh] [srv-comp-hn:04909] mca:base:select:( plm) Query of component [rsh] set priority to 10 [srv-comp-hn:04909] mca:base:select:( plm) Querying component [slurm] [srv-comp-hn:04909] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [srv-comp-hn:04909] mca:base:select:( plm) Selected component [rsh] [srv-comp-hn:04909] mca: base: close: component isolated closed [srv-comp-hn:04909] mca: base: close: unloading component isolated [srv-comp-hn:04909] mca: base: close: component slurm closed [srv-comp-hn:04909] mca: base: close: unloading component slurm [srv-comp-hn:04909] [[15143,0],0] plm:rsh: final template argv: /usr/bin/ssh orted --hnp-topo-sig 0N:4S:4L3:4L2:4L1:8C:8H:x86_64 -mca ess "env" -mca orte_ess_jobid "992411648" -mca orte_ess_vpid "" -mca orte_ess_num_procs "4" -mca orte_hnp_uri "992411648.0;tcp://10.1.28.49,192.168.122.1:33405" --tree-spawn --mca plm_base_verbose "1