As a test, I rebooted a set of nodes. The user could run on 480 cores, on 5 nodes. We could not run beyond two nodes previous to that.

We still get the VM_UNMAP warning, however.

On 9/25/19 2:09 PM, Raymond Muno via users wrote:

We are running against 4.0.2RC2 now. This is ussing current Intel compilers, version 2019update4. Still having issues.

[epyc-compute-1-3.local:17402] common_ucx.c:149  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:17669] common_ucx.c:149  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:17683] common_ucx.c:149  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. [epyc-compute-1-3.local:16626] pml_ucx.c:385  Error: ucp_ep_create(proc=265) failed: Destination is unreachable [epyc-compute-1-3.local:16626] pml_ucx.c:452  Error: Failed to resolve UCX endpoint for rank 265
[epyc-compute-1-3:16626] *** An error occurred in MPI_Allreduce
[epyc-compute-1-3:16626] *** reported by process [47001162088449,46999827120425]
[epyc-compute-1-3:16626] *** on communicator MPI_COMM_WORLD
[epyc-compute-1-3:16626] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-1-3:16626] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[epyc-compute-1-3:16626] ***    and potentially your MPI job)



--
Ray Muno
 IT Manager
 University of Minnesota
 Aerospace Engineering and Mechanics         Mechanical Engineering

Reply via email to