Re: [OMPI users] UCX errors after upgrade

Raymond Muno via users Wed, 25 Sep 2019 12:20:05 -0700

As a test, I rebooted a set of nodes. The user could run on 480 cores,on 5 nodes. We could not run beyond two nodes previous to that.


We still get the VM_UNMAP warning, however.


On 9/25/19 2:09 PM, Raymond Muno via users wrote:

We are running against 4.0.2RC2 now. This is ussing current Intelcompilers, version 2019update4. Still having issues.
[epyc-compute-1-3.local:17402] common_ucx.c:149 Warning: UCX isunable to handle VM_UNMAP event. This may cause performancedegradation or data corruption.[epyc-compute-1-3.local:17669] common_ucx.c:149 Warning: UCX isunable to handle VM_UNMAP event. This may cause performancedegradation or data corruption.[epyc-compute-1-3.local:17683] common_ucx.c:149 Warning: UCX isunable to handle VM_UNMAP event. This may cause performancedegradation or data corruption.[epyc-compute-1-3.local:16626] pml_ucx.c:385 Error:ucp_ep_create(proc=265) failed: Destination is unreachable[epyc-compute-1-3.local:16626] pml_ucx.c:452 Error: Failed to resolveUCX endpoint for rank 265
[epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce
[epyc-compute-1-3:16626] *** reported by process[47001162088449,46999827120425]
[epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD
[epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in thiscommunicator will now abort,
[epyc-compute-1-3:16626] ***    and potentially your MPI job)

--

Ray Muno

 IT Manager
 University of Minnesota
 Aerospace Engineering and Mechanics         Mechanical Engineering

Re: [OMPI users] UCX errors after upgrade

Reply via email to