Can you please send all the information listed here:
https://www.open-mpi.org/community/help/
Thanks!
On Jan 27, 2020, at 12:00 PM, Collin Strassburger via users
<[email protected]<mailto:[email protected]>> wrote:
Hello,
I had initially thought the same thing about the streams, but I have 2 sockets
with 64 cores each. Additionally, I have not yet turned multithreading off, so
lscpu reports a total of 256 logical cores and 128 physical cores. As such, I
don’t see how it could be running out of streams unless something is being
passed incorrectly.
Collin
From: users
<[email protected]<mailto:[email protected]>> On
Behalf Of Ray Sheppard via users
Sent: Monday, January 27, 2020 11:53 AM
To: [email protected]<mailto:[email protected]>
Cc: Ray Sheppard <[email protected]<mailto:[email protected]>>
Subject: Re: [OMPI users] [External] Re: OMPI returns error 63 on AMD 7742 when
utilizing 100+ processors per node
Hi All,
Just my two cents, I think error code 63 is saying it is running out of
streams to use. I think you have only 64 cores, so at 100, you are overloading
most of them. It feels like you are running out of resources trying to swap in
and out ranks on physical cores.
Ray
On 1/27/2020 11:29 AM, Collin Strassburger via users wrote:
This message was sent from a non-IU address. Please exercise caution when
clicking links or opening attachments from external sources.
Hello Howard,
To remove potential interactions, I have found that the issue persists without
ucx and hcoll support.
Run command: mpirun -np 128 bin/xhpcg
Output:
--------------------------------------------------------------------------
mpirun was unable to start the specified application as it encountered an
error:
Error code: 63
Error name: (null)
Node: Gen2Node4
when attempting to start process rank 0.
--------------------------------------------------------------------------
128 total processes failed to start
It returns this error for any process I initialize with >100 processes per
node. I get the same error message for multiple different codes, so the error
code is mpi related rather than being program specific.
Collin
From: Howard Pritchard <[email protected]><mailto:[email protected]>
Sent: Monday, January 27, 2020 11:20 AM
To: Open MPI Users <[email protected]><mailto:[email protected]>
Cc: Collin Strassburger
<[email protected]><mailto:[email protected]>
Subject: Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+
processors per node
Hello Collen,
Could you provide more information about the error. Is there any output from
either Open MPI or, maybe, UCX, that could provide more information about the
problem you are hitting?
Howard
Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users
<[email protected]<mailto:[email protected]>>:
Hello,
I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of these
versions cause the same error (error code 63) when utilizing more than 100
cores on a single node. The processors I am utilizing are AMD Epyc “Rome”
7742s. The OS is CentOS 8.1. I have tried compiling with both the default gcc
8 and locally compiled gcc 9. I have already tried modifying the maximum name
field values with no success.
My compile options are:
./configure
--prefix=${HPCX_HOME}/ompi
--with-platform=contrib/platform/mellanox/optimized
Any assistance would be appreciated,
Collin
Collin Strassburger
Bihrle Applied Research Inc.
--
Jeff Squyres
[email protected]<mailto:[email protected]>