I tried the patch, but I get the same result: error obtaining device attributes for mlx4_0 errno says Success
I'm getting (what I think are) good transfer rates using "--mca btl self,tcp" on the osu_bw test (~7000 MB/s). It seems to me that the only way that could be happening is if the infiniband interfaces are being used over TCP, correct? Would such an arrangement preclude the ability to do RDMA or openib? Perhaps the network is setup in such a way that the IB hardware is not discoverable by openib? (I'm not a network admin, and I wasn't involved in the setup of the network. Unfortunately, the person who knows the most has recently left the organization.) Greg From: Pritchard Jr., Howard <howa...@lanl.gov> Sent: Thursday, October 14, 2021 5:45 PM To: Fischer, Greg A. <fisch...@westinghouse.com>; Open MPI Users <users@lists.open-mpi.org> Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] HI Greg, Oh yes that's not good about rdmacm. Yes the OFED looks pretty old. Did you by any chance apply that patch? I generated that for a sysadmin here who was in the situation where they needed to maintain Open MPI 3.1.6 but had to also upgrade to some newer RHEL release, but the Open MPi wasn't compiling after the RHEL upgrade. Howard From: "Fischer, Greg A." <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> Date: Thursday, October 14, 2021 at 1:47 PM To: "Pritchard Jr., Howard" <howa...@lanl.gov<mailto:howa...@lanl.gov>>, Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: "Fischer, Greg A." <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" I added -enable-mt and re-installed UCX. Same result. (I didn't re-compile OpenMPI.) A conspicuous warning I see in my UCX configure output is: checking for rdma_establish in -lrdmacm... no configure: WARNING: RDMACM requested but librdmacm is not found or does not provide rdma_establish() API The version of librdmacm we have comes from librdmacm-devel-41mlnx1-OFED.4.1.0.1.0.41102.x86_64, which seems to date from mid-2017. I wonder if that's too old? Greg From: Pritchard Jr., Howard <howa...@lanl.gov<mailto:howa...@lanl.gov>> Sent: Thursday, October 14, 2021 3:31 PM To: Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>>; Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] Hi Greg, I think the UCX PML may be discomfited by the lack of thread safety. Could you try using the contrib/configure-release-mt in your ucx folder? You want to add -enable-mt. That's what stands out in your configure output from the one I usually get when building on a MLNX connectx5 cluster with MLNX_OFED_LINUX-4.5-1.0.1.0 Here's the output from one of my UCX configs: configure: ========================================================= configure: UCX build configuration: configure: Build prefix: <foobar>/ucx_testing/ucx/test_install configure: Configuration dir: ${prefix}/etc/ucx configure: Preprocessor flags: -DCPU_FLAGS="" -I${abs_top_srcdir}/src -I${abs_top_builddir} -I${abs_top_builddir}/src configure: C compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/gcc -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch -Wno-pointer-sign -Werror-implicit-function-declaration -Wno-format-zero-length -Wnested-externs -Wshadow -Werror=declaration-after-statement configure: C++ compiler: /users/hpritchard/spack/opt/spack/linux-rhel7-aarch64/gcc-4.8.5/gcc-9.1.0-nhd4fe4i6jtn2hncfzumegojm6hsznxy/bin/g++ -O3 -g -Wall -Werror -funwind-tables -Wno-missing-field-initializers -Wno-unused-parameter -Wno-unused-label -Wno-long-long -Wno-endif-labels -Wno-sign-compare -Wno-multichar -Wno-deprecated-declarations -Winvalid-pch configure: Multi-thread: enabled configure: NUMA support: disabled configure: MPI tests: disabled configure: VFS support: no configure: Devel headers: no configure: io_demo CUDA support: no configure: Bindings: < > configure: UCS modules: < > configure: UCT modules: < ib cma knem > configure: CUDA modules: < > configure: ROCM modules: < > configure: IB modules: < > configure: UCM modules: < > configure: Perf modules: < > configure: ========================================================= Howard From: "Fischer, Greg A." <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> Date: Thursday, October 14, 2021 at 12:46 PM To: "Pritchard Jr., Howard" <howa...@lanl.gov<mailto:howa...@lanl.gov>>, Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: "Fischer, Greg A." <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> Subject: RE: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" Thanks, Howard. I downloaded a current version of UCX (1.11.2) and installed it with OpenMPI 4.1.1. When I try to specify the "-mca pml ucx" for a simple, 2-process benchmark problem, I get: -------------------------------------------------------------------------- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: bl1311 Framework: pml -------------------------------------------------------------------------- [bl1311:20168] PML ucx cannot be selected [bl1311:20169] PML ucx cannot be selected ------------------------------------------------------------ I've attached my ucx_info -d output, as well as the ucx configuration information. I'm not sure I follow everything on the UCX FAQ page, but it seems like everything is being routed over TCP, which is probably not what I want. Any thoughts as to what I might be doing wrong? Thanks, Greg From: Pritchard Jr., Howard <howa...@lanl.gov<mailto:howa...@lanl.gov>> Sent: Wednesday, October 13, 2021 12:28 PM To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: Fischer, Greg A. <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> Subject: Re: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" [External Email] HI Greg, It's the aging of the openib btl. You may be able to apply the attached patch. Note the 3.1.x release stream is no longer supported. You may want to try using the 4.1.1 release, in which case you'll want to use UCX. Howard From: users <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> on behalf of "Fischer, Greg A. via users" <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Reply-To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Date: Wednesday, October 13, 2021 at 10:06 AM To: "users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>" <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: "Fischer, Greg A." <fisch...@westinghouse.com<mailto:fisch...@westinghouse.com>> Subject: [EXTERNAL] [OMPI users] OpenMPI 3.1.6 openib failure: "mlx4_0 errno says Success" Hello, I have compiled OpenMPI 3.1.6 from source on SLES12-SP3, and I am seeing the following errors when I try to use the openib btl: WARNING: There was an error initializing an OpenFabrics device. Local host: bl1308 Local device: mlx4_0 -------------------------------------------------------------------------- [bl1308][[44866,1],5][../../../../../openmpi-3.1.6/opal/mca/btl/openib/btl_openib_component.c:1671:init_one_device] error obtaining device attributes for mlx4_0 errno says Success I have disabled UCX ("--without-ucx") because the UCX installation we have seems to be too out-of-date. ofed_info says "MLNX_OFED_LINUX-4.1-1.0.2.0". I've attached the detailed output of ofed_info and ompi_info. This issue seems similar to Issue #7461 (https://github.com/open-mpi/ompi/issues/7461<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-mpi%2Fompi%2Fissues%2F7461&data=04%7C01%7Cfischega%40westinghouse.com%7C2ef940667a8642b15fb708d98f5c2429%7C516ec17ab92f438b8594e11b6f6bec79%7C0%7C0%7C637698448179354247%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=V9X7UBHQL59T%2BeaF6JuUsovZ6ryrVHSy%2FRFdQIl5gLI%3D&reserved=0>), which I don't see a resolution for. Does anyone know what the likely explanation is? Is the version of OFED on the system badly out-of-sync with contemporary OpenMPI? Thanks, Greg ________________________________ This e-mail may contain proprietary information of the sending organization. Any unauthorized or improper disclosure, copying, distribution, or use of the contents of this e-mail and attached document(s) is prohibited. The information contained in this e-mail and attached document(s) is intended only for the personal and private use of the recipient(s) named above. If you have received this communication in error, please notify the sender immediately by email and delete the original e-mail and attached document(s). ________________________________ This e-mail may contain proprietary information of the sending organization. Any unauthorized or improper disclosure, copying, distribution, or use of the contents of this e-mail and attached document(s) is prohibited. The information contained in this e-mail and attached document(s) is intended only for the personal and private use of the recipient(s) named above. If you have received this communication in error, please notify the sender immediately by email and delete the original e-mail and attached document(s). ________________________________ This e-mail may contain proprietary information of the sending organization. Any unauthorized or improper disclosure, copying, distribution, or use of the contents of this e-mail and attached document(s) is prohibited. The information contained in this e-mail and attached document(s) is intended only for the personal and private use of the recipient(s) named above. If you have received this communication in error, please notify the sender immediately by email and delete the original e-mail and attached document(s). ________________________________ This e-mail may contain proprietary information of the sending organization. Any unauthorized or improper disclosure, copying, distribution, or use of the contents of this e-mail and attached document(s) is prohibited. The information contained in this e-mail and attached document(s) is intended only for the personal and private use of the recipient(s) named above. If you have received this communication in error, please notify the sender immediately by email and delete the original e-mail and attached document(s).