As a test, I rebooted a set of nodes. The user could run on 480 cores,
on 5 nodes. We could not run beyond two nodes previous to that.
We still get the VM_UNMAP warning, however.
On 9/25/19 2:09 PM, Raymond Muno via users wrote:
We are running against 4.0.2RC2 now. This is ussing current Intel
compilers, version 2019update4. Still having issues.
[epyc-compute-1-3.local:17402] common_ucx.c:149 Warning: UCX is
unable to handle VM_UNMAP event. This may cause performance
degradation or data corruption.
[epyc-compute-1-3.local:17669] common_ucx.c:149 Warning: UCX is
unable to handle VM_UNMAP event. This may cause performance
degradation or data corruption.
[epyc-compute-1-3.local:17683] common_ucx.c:149 Warning: UCX is
unable to handle VM_UNMAP event. This may cause performance
degradation or data corruption.
[epyc-compute-1-3.local:16626] pml_ucx.c:385 Error:
ucp_ep_create(proc=265) failed: Destination is unreachable
[epyc-compute-1-3.local:16626] pml_ucx.c:452 Error: Failed to resolve
UCX endpoint for rank 265
[epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce
[epyc-compute-1-3:16626] *** reported by process
[47001162088449,46999827120425]
[epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD
[epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[epyc-compute-1-3:16626] *** and potentially your MPI job)
--
Ray Muno
IT Manager
University of Minnesota
Aerospace Engineering and Mechanics Mechanical Engineering