Hi,
we have 2 DGX A100 machines and I'm trying to run nccl-tests
(https://github.com/NVIDIA/nccl-tests) in various ways to understand how
things work.
I can successfully run nccl-tests on both nodes with Slurm (via srun)
when built directly on a compute node against Open MPI 4.1.2 coming fro
j/cuda/11.0/Linux_x86_64'
'--with-ucx=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1'
Matthias
Am 24.01.22 um 15:59 schrieb Ralph Castain via users:
If you look at your configure line, you forgot to include
--with-pmi=. We don't build the Slurm PMI suppor
7;--enable-mpi1-compatibility'
'--enable-mca-no-build=btl-uct' '--without-verbs'
'--with-cuda=/proj/cuda/11.0/Linux_x86_64'
'--with-ucx=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1'
Matthias
Am 24.01.22 um 15:59 schrieb Ralp
the missing option in it. The bottom
one does not use that platform file, so it was probably missed.
> On Jan 24, 2022, at 7:17 AM, Matthias Leopold via users
mailto:users@lists.open-mpi.org>> wrote:
>
> To be sure: both packages were provided by NVIDIA (I di
PMIx library version used by SLURM is 3.2.3
Am 25.01.22 um 11:04 schrieb Gilles Gouaillardet:
PMIx library version used by SLURM
>
> You should probably ask them - I see in the top one that they
used a
> platform file, which likely had the missing option in it. The
bottom
> one does not use that platform file, so it was probably missed.
>
>
>
n anything like that before - am I reading those errors correctly that it cannot
find the "write" function symbol in libc?? Frankly, if that's true then it
sounds like something is borked in the system.
On Jan 25, 2022, at 8:26 AM, Matthias Leopold via users
wrote:
just i