Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Matthias Leopold via users
Thanks a lot for feedback to you and Gilles. I'm completely new to this, at least I know now what _should_ work. I'll look into the lmod part, maybe I screwed something there, I'm a newbie there too... Matthias Am 25.01.22 um 18:17 schrieb Ralph Castain via users: Never seen anything like tha

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Ralph Castain via users
Never seen anything like that before - am I reading those errors correctly that it cannot find the "write" function symbol in libc?? Frankly, if that's true then it sounds like something is borked in the system. > On Jan 25, 2022, at 8:26 AM, Matthias Leopold via users > wrote: > > just in c

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Matthias Leopold via users
just in case anyone wants to do more debugging: I ran "srun --mpi=pmix" now with "LD_DEBUG=all", the lines preceding the error are 1263345: symbol=write; lookup in file=/lib/x86_64-linux-gnu/libpthread.so.0 [0] 1263345: binding file /msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/lib/

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Matthias Leopold via users
PMIx library version used by SLURM is 3.2.3 Am 25.01.22 um 11:04 schrieb Gilles Gouaillardet: PMIx library version used by SLURM

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Gilles Gouaillardet via users
Matthias, Thanks for the clarifications. Unfortunately, I cannot connect the dots and I must be missing something. If I recap correctly: - SLURM has builtin PMIx support - Open MPI has builtin PMIx support - srun explicitly requires PMIx (srun --mpi=pmix_v3 ...) - and yet Open MPI issues an

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Matthias Leopold via users
Hi Gilles, I'm indeed using srun, I didn't have luck using mpirun yet. Are option 2 + 3 of your list really different things? As far as I understood now I need "Open MPI with PMI support", THEN I can use srun with PMIx. Right now using "srun --mpi=pmix(_v3)" gives the error mentioned below.

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Gilles Gouaillardet via users
Matthias, do you run the MPI application with mpirun or srun? The error log suggests you are using srun, and SLURM only provides only PMI support. If this is the case, then you have three options: - use mpirun - rebuild Open MPI with PMI support as Ralph previously explained - use SLURM PMIx:

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Ralph Castain via users
You should probably ask them - I see in the top one that they used a platform file, which likely had the missing option in it. The bottom one does not use that platform file, so it was probably missed. > On Jan 24, 2022, at 7:17 AM, Matthias Leopold via users > wrote: > > To be sure: both pa

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Matthias Leopold via users
To be sure: both packages were provided by NVIDIA (I didn't compile them) Am 24.01.22 um 16:13 schrieb Matthias Leopold: Thx, but I don't see this option in any of the two versions: /usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info (works with slurm):   Configure command line: '--build=x86_64-linux-g

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Matthias Leopold via users
Thx, but I don't see this option in any of the two versions: /usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info (works with slurm): Configure command line: '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sy

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Ralph Castain via users
If you look at your configure line, you forgot to include --with-pmi=. We don't build the Slurm PMI support by default due to the GPL licensing issues - you have to point at it. > On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users > wrote: > > Hi, > > we have 2 DGX A100 machines and I'

[OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Matthias Leopold via users
Hi, we have 2 DGX A100 machines and I'm trying to run nccl-tests (https://github.com/NVIDIA/nccl-tests) in various ways to understand how things work. I can successfully run nccl-tests on both nodes with Slurm (via srun) when built directly on a compute node against Open MPI 4.1.2 coming fro