Thanks a lot for feedback to you and Gilles. I'm completely new to this,
at least I know now what _should_ work. I'll look into the lmod part,
maybe I screwed something there, I'm a newbie there too...
Matthias
Am 25.01.22 um 18:17 schrieb Ralph Castain via users:
Never seen anything like tha
Never seen anything like that before - am I reading those errors correctly that
it cannot find the "write" function symbol in libc?? Frankly, if that's true
then it sounds like something is borked in the system.
> On Jan 25, 2022, at 8:26 AM, Matthias Leopold via users
> wrote:
>
> just in c
just in case anyone wants to do more debugging: I ran "srun --mpi=pmix"
now with "LD_DEBUG=all", the lines preceding the error are
1263345: symbol=write; lookup in
file=/lib/x86_64-linux-gnu/libpthread.so.0 [0]
1263345: binding file
/msc/sw/hpc-sdk/Linux_x86_64/21.9/comm_libs/mpi/lib/
PMIx library version used by SLURM is 3.2.3
Am 25.01.22 um 11:04 schrieb Gilles Gouaillardet:
PMIx library version used by SLURM
Matthias,
Thanks for the clarifications.
Unfortunately, I cannot connect the dots and I must be missing something.
If I recap correctly:
- SLURM has builtin PMIx support
- Open MPI has builtin PMIx support
- srun explicitly requires PMIx (srun --mpi=pmix_v3 ...)
- and yet Open MPI issues an
Hi Gilles,
I'm indeed using srun, I didn't have luck using mpirun yet.
Are option 2 + 3 of your list really different things? As far as I
understood now I need "Open MPI with PMI support", THEN I can use srun
with PMIx. Right now using "srun --mpi=pmix(_v3)" gives the error
mentioned below.
Matthias,
do you run the MPI application with mpirun or srun?
The error log suggests you are using srun, and SLURM only provides only PMI
support.
If this is the case, then you have three options:
- use mpirun
- rebuild Open MPI with PMI support as Ralph previously explained
- use SLURM PMIx:
You should probably ask them - I see in the top one that they used a platform
file, which likely had the missing option in it. The bottom one does not use
that platform file, so it was probably missed.
> On Jan 24, 2022, at 7:17 AM, Matthias Leopold via users
> wrote:
>
> To be sure: both pa
To be sure: both packages were provided by NVIDIA (I didn't compile them)
Am 24.01.22 um 16:13 schrieb Matthias Leopold:
Thx, but I don't see this option in any of the two versions:
/usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info (works with slurm):
Configure command line: '--build=x86_64-linux-g
Thx, but I don't see this option in any of the two versions:
/usr/mpi/gcc/openmpi-4.1.2a1/bin/ompi_info (works with slurm):
Configure command line: '--build=x86_64-linux-gnu' '--prefix=/usr'
'--includedir=${prefix}/include' '--mandir=${prefix}/share/man'
'--infodir=${prefix}/share/info' '--sy
If you look at your configure line, you forgot to include
--with-pmi=. We don't build the Slurm PMI support by
default due to the GPL licensing issues - you have to point at it.
> On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users
> wrote:
>
> Hi,
>
> we have 2 DGX A100 machines and I'
Hi,
we have 2 DGX A100 machines and I'm trying to run nccl-tests
(https://github.com/NVIDIA/nccl-tests) in various ways to understand how
things work.
I can successfully run nccl-tests on both nodes with Slurm (via srun)
when built directly on a compute node against Open MPI 4.1.2 coming fro
12 matches
Mail list logo