[OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Matthias Leopold via users
Hi, we have 2 DGX A100 machines and I'm trying to run nccl-tests (https://github.com/NVIDIA/nccl-tests) in various ways to understand how things work. I can successfully run nccl-tests on both nodes with Slurm (via srun) when built directly on a compute node against Open MPI 4.1.2 coming fro

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Matthias Leopold via users
j/cuda/11.0/Linux_x86_64' '--with-ucx=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1' Matthias Am 24.01.22 um 15:59 schrieb Ralph Castain via users: If you look at your configure line, you forgot to include --with-pmi=. We don't build the Slurm PMI suppor

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Matthias Leopold via users
7;--enable-mpi1-compatibility' '--enable-mca-no-build=btl-uct' '--without-verbs' '--with-cuda=/proj/cuda/11.0/Linux_x86_64' '--with-ucx=/proj/nv/libraries/Linux_x86_64/dev/openmpi4/205295-dev-clean-1' Matthias Am 24.01.22 um 15:59 schrieb Ralp

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Matthias Leopold via users
the missing option in it. The bottom one does not use that platform file, so it was probably missed. > On Jan 24, 2022, at 7:17 AM, Matthias Leopold via users mailto:users@lists.open-mpi.org>> wrote: > > To be sure: both packages were provided by NVIDIA (I di

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Matthias Leopold via users
PMIx library version used by SLURM is 3.2.3 Am 25.01.22 um 11:04 schrieb Gilles Gouaillardet: PMIx library version used by SLURM

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Matthias Leopold via users
> >     You should probably ask them - I see in the top one that they used a >     platform file, which likely had the missing option in it. The bottom >     one does not use that platform file, so it was probably missed. > > >     

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Matthias Leopold via users
n anything like that before - am I reading those errors correctly that it cannot find the "write" function symbol in libc?? Frankly, if that's true then it sounds like something is borked in the system. On Jan 25, 2022, at 8:26 AM, Matthias Leopold via users wrote: just i