We are now using OpenMPI 4.0.2RC2 and RC3 compiled (with Intel, PGI and
GCC) with MLNX_OFED 4.7 (released a couple days ago). This supplies UCX
1.7. So far, it seems like things are working well.
Any estimate on when OpenMPI 4.2 will be released?
On 9/25/19 2:27 PM, Jeff Squyres (jsquyres) wrote:
Thanks Raymond; I have filed an issue for this on Github and tagged
the relevant Mellanox people:
https://github.com/open-mpi/ompi/issues/7009
On Sep 25, 2019, at 3:09 PM, Raymond Muno via users
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
We are running against 4.0.2RC2 now. This is ussing current Intel
compilers, version 2019update4. Still having issues.
[epyc-compute-1-3.local:17402] common_ucx.c:149 Warning: UCX is
unable to handle VM_UNMAP event. This may cause performance
degradation or data corruption.
[epyc-compute-1-3.local:17669] common_ucx.c:149 Warning: UCX is
unable to handle VM_UNMAP event. This may cause performance
degradation or data corruption.
[epyc-compute-1-3.local:17683] common_ucx.c:149 Warning: UCX is
unable to handle VM_UNMAP event. This may cause performance
degradation or data corruption.
[epyc-compute-1-3.local:16626] pml_ucx.c:385 Error:
ucp_ep_create(proc=265) failed: Destination is unreachable
[epyc-compute-1-3.local:16626] pml_ucx.c:452 Error: Failed to
resolve UCX endpoint for rank 265
[epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce
[epyc-compute-1-3:16626] *** reported by process
[47001162088449,46999827120425]
[epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD
[epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[epyc-compute-1-3:16626] *** and potentially your MPI job)
On 9/25/19 1:28 PM, Jeff Squyres (jsquyres) via users wrote:
Can you try the latest 4.0.2rc tarball? We're very, very close to
releasing v4.0.2...
I don't know if there's a specific UCX fix in there, but there are a
ton of other good bug fixes in there since v4.0.1.
On Sep 25, 2019, at 2:12 PM, Raymond Muno via users
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed.
On our cluster, we were running CentOS 7.5 with updates, alongside
MLNX_OFED 4.5.x. OpenMPI was compiled with GCC, Intel, PGI and
AOCC compilers. We could run with no issues.
To accommodate updates needed to get our IB gear all running at
HDR100 (EDR50 previously) we upgraded to CentOS 7.6.1810 and the
current MLNX_OFED 4.6.x.
We can no longer reliably run on more than two nodes.
We see errors like:
[epyc-compute-3-2.local:42447] pml_ucx.c:380 Error:
ucp_ep_create(proc=276) failed: Destination is unreachable
[epyc-compute-3-2.local:42447] pml_ucx.c:447 Error: Failed to
resolve UCX endpoint for rank 276
[epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce
[epyc-compute-3-2:42447] *** reported by process
[47894553493505,47893180318004]
[epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD
[epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in
this communicator will now abort,
[epyc-compute-3-2:42447] *** and potentially your MPI job)
[epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file
server/pmix_server.c at line 2079
[epyc-compute-3-17.local:37008] pml_ucx.c:380 Error:
ucp_ep_create(proc=147) failed: Destination is unreachable
[epyc-compute-3-17.local:37008] pml_ucx.c:447 Error: Failed to
resolve UCX endpoint for rank 147
[epyc-compute-3-7.local:39776] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[epyc-compute-3-7.local:39776] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
UCX appears to be part of the MLNX_OFED release, and is version 1.6.0.
OpenMPI is is built on the same OS and MLNX_OFED, as we are running
on the compute nodes.
I have a case open with Mellanox but it is not clear where this
error is coming from.
--
--
Jeff Squyres
jsquy...@cisco.com <mailto:jsquy...@cisco.com>
--
Ray Muno
IT Manager
University of Minnesota
Aerospace Engineering and Mechanics Mechanical Engineering
--
Jeff Squyres
jsquy...@cisco.com <mailto:jsquy...@cisco.com>
--
Ray Muno
IT Manager
University of Minnesota
Aerospace Engineering and Mechanics Mechanical Engineering