Hi All,
Just my two cents, I think error code 63 is saying it is running out
of streams to use. I think you have only 64 cores, so at 100, you are
overloading most of them. It feels like you are running out of
resources trying to swap in and out ranks on physical cores.
Ray
On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
This message was sent from a non-IU address. Please exercise caution
when clicking links or opening attachments from external sources.
Hello Howard,
To remove potential interactions, I have found that the issue persists
without ucx and hcoll support.
Run command: mpirun -np 128 bin/xhpcg
Output:
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:
Error code: 63
Error name: (null)
Node: Gen2Node4
when attempting to start process rank 0.
--------------------------------------------------------------------------
128 total processes failed to start
It returns this error for any process I initialize with >100 processes
per node. I get the same error message for multiple different codes,
so the error code is mpi related rather than being program specific.
Collin
*From:* Howard Pritchard <hpprit...@gmail.com>
*Sent:* Monday, January 27, 2020 11:20 AM
*To:* Open MPI Users <users@lists.open-mpi.org>
*Cc:* Collin Strassburger <cstrassbur...@bihrle.com>
*Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors per node
Hello Collen,
Could you provide more information about the error. Is there any
output from either Open MPI or, maybe, UCX, that could provide more
information about the problem you are hitting?
Howard
Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via
users <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>>:
Hello,
I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.
Both of these versions cause the same error (error code 63) when
utilizing more than 100 cores on a single node. The processors I
am utilizing are AMD Epyc “Rome” 7742s. The OS is CentOS 8.1. I
have tried compiling with both the default gcc 8 and locally
compiled gcc 9. I have already tried modifying the maximum name
field values with no success.
My compile options are:
./configure
--prefix=${HPCX_HOME}/ompi
--with-platform=contrib/platform/mellanox/optimized
Any assistance would be appreciated,
Collin
Collin Strassburger
Bihrle Applied Research Inc.