[slurm-users] Re: srun weirdness

Feng Zhang via slurm-users Tue, 14 May 2024 13:08:24 -0700

Do you have containers setting?

On Tue, May 14, 2024 at 3:57 PM Feng Zhang <prod.f...@gmail.com> wrote:
>
> Not sure, very strange, while the two linux-vdso.so.1 looks different:
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>      linux-vdso.so.1 (0x00007ffde81ee000)
>
>
> [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
>      linux-vdso.so.1 (0x00007fffa66ff000)
>
> Best,
>
> Feng
>
> On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users
> <slurm-users@lists.schedmd.com> wrote:
> >
> > Hi Feng,
> > Thank you for replying.
> >
> > It is the same binary on the same machine that fails.
> >
> > If I ssh to a compute node on the second cluster, it works fine.
> >
> > It fails when running in an interactive shell obtained with srun on that
> > same compute node.
> >
> > I agree that it seems like a runtime environment difference between the
> > SSH shell and the srun obtained shell.
> >
> > This is the ldd from within the srun obtained shell (and gives the error
> > when run):
> >
> > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
> >      linux-vdso.so.1 (0x00007ffde81ee000)
> >      libresolv.so.2 => /lib64/libresolv.so.2 (0x0000154f732cc000)
> >      libpthread.so.0 => /lib64/libpthread.so.0 (0x0000154f732c7000)
> >      libstdc++.so.6 => /lib64/libstdc++.so.6 (0x0000154f73000000)
> >      librt.so.1 => /lib64/librt.so.1 (0x0000154f732c2000)
> >      libdl.so.2 => /lib64/libdl.so.2 (0x0000154f732bb000)
> >      libm.so.6 => /lib64/libm.so.6 (0x0000154f72f25000)
> >      libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000154f732a0000)
> >      libc.so.6 => /lib64/libc.so.6 (0x0000154f72c00000)
> >      /lib64/ld-linux-x86-64.so.2 (0x0000154f732f8000)
> >
> > This is the ldd from the same exact node within an SSH shell which runs
> > fine:
> >
> > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
> >      linux-vdso.so.1 (0x00007fffa66ff000)
> >      libresolv.so.2 => /lib64/libresolv.so.2 (0x000014a9d82da000)
> >      libpthread.so.0 => /lib64/libpthread.so.0 (0x000014a9d82d5000)
> >      libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000014a9d8000000)
> >      librt.so.1 => /lib64/librt.so.1 (0x000014a9d82d0000)
> >      libdl.so.2 => /lib64/libdl.so.2 (0x000014a9d82c9000)
> >      libm.so.6 => /lib64/libm.so.6 (0x000014a9d7f25000)
> >      libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000014a9d82ae000)
> >      libc.so.6 => /lib64/libc.so.6 (0x000014a9d7c00000)
> >      /lib64/ld-linux-x86-64.so.2 (0x000014a9d8306000)
> >
> >
> > -Dj
> >
> >
> >
> > On 5/14/24 15:25, Feng Zhang via slurm-users wrote:
> > > Looks more like a runtime environment issue.
> > >
> > > Check the binaries:
> > >
> > > ldd  /mnt/local/ollama/ollama
> > >
> > > on both clusters and comparing the output may give some hints.
> > >
> > > Best,
> > >
> > > Feng
> > >
> > > On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
> > > <slurm-users@lists.schedmd.com> wrote:
> > >> I'm running into a strange issue and I'm hoping another set of brains
> > >> looking at this might help.  I would appreciate any feedback.
> > >>
> > >> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
> > >> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> > >> 23.11.6 on Rocky Linux 9.4 machines.
> > >>
> > >> This works perfectly fine on the first cluster:
> > >>
> > >> $ srun --mem=32G --pty /bin/bash
> > >>
> > >> srun: job 93911 queued and waiting for resources
> > >> srun: job 93911 has been allocated resources
> > >>
> > >> and on the resulting shell on the compute node:
> > >>
> > >> $ /mnt/local/ollama/ollama help
> > >>
> > >> and the ollama help message appears as expected.
> > >>
> > >> However, on the second cluster:
> > >>
> > >> $ srun --mem=32G --pty /bin/bash
> > >> srun: job 3 queued and waiting for resources
> > >> srun: job 3 has been allocated resources
> > >>
> > >> and on the resulting shell on the compute node:
> > >>
> > >> $ /mnt/local/ollama/ollama help
> > >> fatal error: failed to reserve page summary memory
> > >> runtime stack:
> > >> runtime.throw({0x1240c66?, 0x154fa39a1008?})
> > >>       runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
> > >> pc=0x4605dc
> > >> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
> > >>       runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> > >> sp=0x7ffe6be32648 pc=0x456b7c
> > >> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
> > >>       runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
> > >> pc=0x454565
> > >> runtime.(*mheap).init(0x127b47e0)
> > >>       runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> > >> pc=0x451885
> > >> runtime.mallocinit()
> > >>       runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
> > >> pc=0x434f97
> > >> runtime.schedinit()
> > >>       runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
> > >> pc=0x464397
> > >> runtime.rt0_go()
> > >>       runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0
> > >> pc=0x49421c
> > >>
> > >>
> > >> If I ssh directly to the same node on that second cluster (skipping
> > >> Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
> > >> command, it works perfectly fine.
> > >>
> > >>
> > >> My first thought was that it might be related to cgroups.  I switched
> > >> the second cluster from cgroups v2 to v1 and tried again, no
> > >> difference.  I tried disabling cgroups on the second cluster by removing
> > >> all cgroups references in the slurm.conf file but that also made no
> > >> difference.
> > >>
> > >>
> > >> My guess is something changed with regards to srun between these two
> > >> Slurm versions, but I'm not sure what.
> > >>
> > >> Any thoughts on what might be happening and/or a way to get this to work
> > >> on the second cluster?  Essentially I need a way to request an
> > >> interactive shell through Slurm that is associated with the requested
> > >> resources.  Should we be using something other than srun for this?
> > >>
> > >>
> > >> Thank you,
> > >>
> > >> -Dj
> > >>
> > >>
> > >>
> > >> --
> > >> slurm-users mailing list -- slurm-users@lists.schedmd.com
> > >> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
> >
> >
> > --
> > slurm-users mailing list -- slurm-users@lists.schedmd.com
> > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: srun weirdness

Reply via email to