Re: [OMPI users] UCX errors after upgrade

Jeff Squyres (jsquyres) via users Wed, 25 Sep 2019 12:30:09 -0700

Thanks Raymond; I have filed an issue for this on Github and tagged the 
relevant Mellanox people:


    https://github.com/open-mpi/ompi/issues/7009


On Sep 25, 2019, at 3:09 PM, Raymond Muno via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:


We are running against 4.0.2RC2 now. This is ussing current Intel compilers, 
version 2019update4. Still having issues.

[epyc-compute-1-3.local:17402] common_ucx.c:149  Warning: UCX is unable to 
handle VM_UNMAP event. This may cause performance degradation or data 
corruption.
[epyc-compute-1-3.local:17669] common_ucx.c:149  Warning: UCX is unable to 
handle VM_UNMAP event. This may cause performance degradation or data 
corruption.
[epyc-compute-1-3.local:17683] common_ucx.c:149  Warning: UCX is unable to 
handle VM_UNMAP event. This may cause performance degradation or data 
corruption.
[epyc-compute-1-3.local:16626] pml_ucx.c:385  Error: ucp_ep_create(proc=265) 
failed: Destination is unreachable
[epyc-compute-1-3.local:16626] pml_ucx.c:452  Error: Failed to resolve UCX 
endpoint for rank 265
[epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce
[epyc-compute-1-3:16626] *** reported by process [47001162088449,46999827120425]
[epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD
[epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[epyc-compute-1-3:16626] ***    and potentially your MPI job)


On 9/25/19 1:28 PM, Jeff Squyres (jsquyres) via users wrote:
Can you try the latest 4.0.2rc tarball?  We're very, very close to releasing 
v4.0.2...

I don't know if there's a specific UCX fix in there, but there are a ton of 
other good bug fixes in there since v4.0.1.


On Sep 25, 2019, at 2:12 PM, Raymond Muno via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:


We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed.

On our cluster, we were running CentOS 7.5 with updates, alongside MLNX_OFED 
4.5.x.   OpenMPI was compiled with GCC, Intel, PGI and AOCC compilers. We could 
run with no issues.

To accommodate updates needed to get our IB gear all running at HDR100 (EDR50 
previously) we upgraded to CentOS 7.6.1810 and the current MLNX_OFED 4.6.x.

We can no longer reliably run on more than two nodes.

We see errors like:

[epyc-compute-3-2.local:42447] pml_ucx.c:380  Error: ucp_ep_create(proc=276) 
failed: Destination is unreachable
[epyc-compute-3-2.local:42447] pml_ucx.c:447  Error: Failed to resolve UCX 
endpoint for rank 276
[epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce
[epyc-compute-3-2:42447] *** reported by process [47894553493505,47893180318004]
[epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD
[epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in this 
communicator will now abort,
[epyc-compute-3-2:42447] ***    and potentially your MPI job)
[epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file 
server/pmix_server.c at line 2079
[epyc-compute-3-17.local:37008] pml_ucx.c:380  Error: ucp_ep_create(proc=147) 
failed: Destination is unreachable
[epyc-compute-3-17.local:37008] pml_ucx.c:447  Error: Failed to resolve UCX 
endpoint for rank 147
[epyc-compute-3-7.local:39776] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal
[epyc-compute-3-7.local:39776] Set MCA parameter "orte_base_help_aggregate" to 
0 to see all help / error messages

UCX appears to be part of the MLNX_OFED release, and is version 1.6.0.

OpenMPI is is built on the same OS and MLNX_OFED, as we are running on the 
compute nodes.

I have a case open with Mellanox but it is not clear where this error is coming 
from.

--




--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>


--

 Ray Muno
 IT Manager
 University of Minnesota
 Aerospace Engineering and Mechanics         Mechanical Engineering




--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>

Re: [OMPI users] UCX errors after upgrade

Reply via email to