Hi Greg,
I think the UCX PML may be discomfited by the lack of thread safety.
Could you try using the contrib/configure-release-mt in your ucx folder? You
want to add –enable-mt.
That’s what stands out in your configure output from the one I usually get when
building on a MLNX connectx5 cluster with
MLNX_OFED_LINUX-4.5-1.0.1.0
Here’s the output from one of my UCX configs:
configure: =========================================================
configure: UCX build configuration:
configure: Build prefix: <foobar>/ucx_testing/ucx/test_install
configure: Configuration dir: ${prefix}/etc/ucx
configure: Preprocessor flags: -DCPU_FLAGS="" -I${abs_top_srcdir}/src
-I${abs_top_builddir} -I${abs_top_builddir}/src
configure: C compiler:
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc
-O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
-Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length
-Wnested-externs -Wshadow -Werror=declaration-after-statement
configure: C++ compiler:
/users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++
-O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers
-Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels
-Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch
configure: Multi-thread: enabled
configure: NUMA support: disabled
configure: MPI tests: disabled
configure: VFS support: no
configure: Devel headers: no
configure: io_demo CUDA support: no
configure: Bindings: < >
configure: UCS modules: < >
configure: UCT modules: < ib cma knem >
configure: CUDA modules: < >
configure: ROCM modules: < >
configure: IB modules: < >
configure: UCM modules: < >
configure: Perf modules: < >
configure: =========================================================
Howard
From: "Fischer, Greg A." <[email protected]>
Date: Thursday, October 14, 2021 at 12:46 PM
To: "Pritchard Jr., Howard" <[email protected]>, Open MPI Users
<[email protected]>
Cc: "Fischer, Greg A." <[email protected]>
Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0
errno says Success"
Thanks, Howard.
I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI
4.1.1. When I try to specify the “-mca pml ucx” for a simple, 2-process
benchmark problem, I get:
--------------------------------------------------------------------------
No components were able to be opened in the pml framework.
This typically means that either no components of this type were
installed, or none of the installed components can be loaded.
Sometimes this means that shared libraries required by these
components are unable to be found/loaded.
Host: bl1311
Framework: pml
--------------------------------------------------------------------------
[bl1311:20168] PML ucx cannot be selected
[bl1311:20169] PML ucx cannot be selected
------------------------------------------------------------
I’ve attached my ucx_info -d output, as well as the ucx configuration
information. I’m not sure I follow everything on the UCX FAQ page, but it seems
like everything is being routed over TCP, which is probably not what I want.
Any thoughts as to what I might be doing wrong?
Thanks,
Greg
From: Pritchard Jr., Howard <[email protected]>
Sent: Wednesday, October 13, 2021 12:28 PM
To: Open MPI Users <[email protected]>
Cc: Fischer, Greg A. <[email protected]>
Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0
errno says Success"
[External Email]
HI Greg,
It’s the aging of the openib btl.
You may be able to apply the attached patch. Note the 3.1.x release stream is
no longer supported.
You may want to try using the 4.1.1 release, in which case you’ll want to use
UCX.
Howard
From: users
<[email protected]<mailto:[email protected]>> on
behalf of "Fischer, Greg A. via users"
<[email protected]<mailto:[email protected]>>
Reply-To: Open MPI Users
<[email protected]<mailto:[email protected]>>
Date: Wednesday, October 13, 2021 at 10:06 AM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Cc: "Fischer, Greg A."
<[email protected]<mailto:[email protected]>>
Subject: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno
says Success"
Hello,
I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the
following errors when I try to use the openib btl:
WARNING: There was an error initializing an OpenFabrics device.
Local host: bl1308
Local device: mlx4_0
--------------------------------------------------------------------------
[bl1308][[44866,1],5][../../../../../openmpi-3.1.6/opal/mca/btl/openib/btl_openib_component.c:1671:init_one_device]
error obtaining device attributes for mlx4_0 errno says Success
I have disabled UCX ("--without-ucx") because the UCX installation we have
seems to be too out-of-date. ofed_info says "MLNX_OFED_LINUX-4.1-1.0.2.0". I've
attached the detailed output of ofed_info and ompi_info.
This issue seems similar to Issue #7461
(https://github.com/open-mpi/ompi/issues/7461<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-mpi%2Fompi%2Fissues%2F7461&data=04%7C01%7Cfischega%40westinghouse.com%7Cfe8eac2c9dfb4f26781a08d98e667521%7C516ec17ab92f438b8594e11b6f6bec79%7C0%7C0%7C637697392985500288%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=uZVYaEU3YA7hcUD%2F4Mrtarmo26J64O41I9WlPDPpXLk%3D&reserved=0>),
which I don't see a resolution for.
Does anyone know what the likely explanation is? Is the version of OFED on the
system badly out-of-sync with contemporary OpenMPI?
Thanks,
Greg
________________________________
This e-mail may contain proprietary information of the sending organization.
Any unauthorized or improper disclosure, copying, distribution, or use of the
contents of this e-mail and attached document(s) is prohibited. The information
contained in this e-mail and attached document(s) is intended only for the
personal and private use of the recipient(s) named above. If you have received
this communication in error, please notify the sender immediately by email and
delete the original e-mail and attached document(s).
________________________________
This e-mail may contain proprietary information of the sending organization.
Any unauthorized or improper disclosure, copying, distribution, or use of the
contents of this e-mail and attached document(s) is prohibited. The information
contained in this e-mail and attached document(s) is intended only for the
personal and private use of the recipient(s) named above. If you have received
this communication in error, please notify the sender immediately by email and
delete the original e-mail and attached document(s).