Collin, A couple of things to try. First, could you just configure without using the mellanox platform file and see if you can run the app with 100 or more processes? Another thing to try is to keep using the mellanox platform file, but run the app with
mpirun --mca pml ob1 -np 100 bin/xhpcg and see if the app runs successfully. Howard Am Mo., 27. Jan. 2020 um 09:29 Uhr schrieb Collin Strassburger < cstrassbur...@bihrle.com>: > Hello Howard, > > > > To remove potential interactions, I have found that the issue persists > without ucx and hcoll support. > > > > Run command: mpirun -np 128 bin/xhpcg > > Output: > > -------------------------------------------------------------------------- > > mpirun was unable to start the specified application as it encountered an > > error: > > > > Error code: 63 > > Error name: (null) > > Node: Gen2Node4 > > > > when attempting to start process rank 0. > > -------------------------------------------------------------------------- > > 128 total processes failed to start > > > > It returns this error for any process I initialize with >100 processes per > node. I get the same error message for multiple different codes, so the > error code is mpi related rather than being program specific. > > > > Collin > > > > *From:* Howard Pritchard <hpprit...@gmail.com> > *Sent:* Monday, January 27, 2020 11:20 AM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Cc:* Collin Strassburger <cstrassbur...@bihrle.com> > *Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when > utilizing 100+ processors per node > > > > Hello Collen, > > > > Could you provide more information about the error. Is there any output > from either Open MPI or, maybe, UCX, that could provide more information > about the problem you are hitting? > > > > Howard > > > > > > Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users < > users@lists.open-mpi.org>: > > Hello, > > > > I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5. Both of > these versions cause the same error (error code 63) when utilizing more > than 100 cores on a single node. The processors I am utilizing are AMD > Epyc “Rome” 7742s. The OS is CentOS 8.1. I have tried compiling with both > the default gcc 8 and locally compiled gcc 9. I have already tried > modifying the maximum name field values with no success. > > > > My compile options are: > > ./configure > > --prefix=${HPCX_HOME}/ompi > > --with-platform=contrib/platform/mellanox/optimized > > > > Any assistance would be appreciated, > > Collin > > > > Collin Strassburger > > Bihrle Applied Research Inc. > > > >