Do you have containers setting? On Tue, May 14, 2024 at 3:57 PM Feng Zhang <prod.f...@gmail.com> wrote: > > Not sure, very strange, while the two linux-vdso.so.1 looks different: > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > linux-vdso.so.1 (0x00007ffde81ee000) > > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > linux-vdso.so.1 (0x00007fffa66ff000) > > Best, > > Feng > > On Tue, May 14, 2024 at 3:43 PM Dj Merrill via slurm-users > <slurm-users@lists.schedmd.com> wrote: > > > > Hi Feng, > > Thank you for replying. > > > > It is the same binary on the same machine that fails. > > > > If I ssh to a compute node on the second cluster, it works fine. > > > > It fails when running in an interactive shell obtained with srun on that > > same compute node. > > > > I agree that it seems like a runtime environment difference between the > > SSH shell and the srun obtained shell. > > > > This is the ldd from within the srun obtained shell (and gives the error > > when run): > > > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > > linux-vdso.so.1 (0x00007ffde81ee000) > > libresolv.so.2 => /lib64/libresolv.so.2 (0x0000154f732cc000) > > libpthread.so.0 => /lib64/libpthread.so.0 (0x0000154f732c7000) > > libstdc++.so.6 => /lib64/libstdc++.so.6 (0x0000154f73000000) > > librt.so.1 => /lib64/librt.so.1 (0x0000154f732c2000) > > libdl.so.2 => /lib64/libdl.so.2 (0x0000154f732bb000) > > libm.so.6 => /lib64/libm.so.6 (0x0000154f72f25000) > > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000154f732a0000) > > libc.so.6 => /lib64/libc.so.6 (0x0000154f72c00000) > > /lib64/ld-linux-x86-64.so.2 (0x0000154f732f8000) > > > > This is the ldd from the same exact node within an SSH shell which runs > > fine: > > > > [deej@moose66 ~]$ ldd /mnt/local/ollama/ollama > > linux-vdso.so.1 (0x00007fffa66ff000) > > libresolv.so.2 => /lib64/libresolv.so.2 (0x000014a9d82da000) > > libpthread.so.0 => /lib64/libpthread.so.0 (0x000014a9d82d5000) > > libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000014a9d8000000) > > librt.so.1 => /lib64/librt.so.1 (0x000014a9d82d0000) > > libdl.so.2 => /lib64/libdl.so.2 (0x000014a9d82c9000) > > libm.so.6 => /lib64/libm.so.6 (0x000014a9d7f25000) > > libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000014a9d82ae000) > > libc.so.6 => /lib64/libc.so.6 (0x000014a9d7c00000) > > /lib64/ld-linux-x86-64.so.2 (0x000014a9d8306000) > > > > > > -Dj > > > > > > > > On 5/14/24 15:25, Feng Zhang via slurm-users wrote: > > > Looks more like a runtime environment issue. > > > > > > Check the binaries: > > > > > > ldd /mnt/local/ollama/ollama > > > > > > on both clusters and comparing the output may give some hints. > > > > > > Best, > > > > > > Feng > > > > > > On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users > > > <slurm-users@lists.schedmd.com> wrote: > > >> I'm running into a strange issue and I'm hoping another set of brains > > >> looking at this might help. I would appreciate any feedback. > > >> > > >> I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 > > >> on Rocky Linux 8.9 machines. The second cluster is running Slurm > > >> 23.11.6 on Rocky Linux 9.4 machines. > > >> > > >> This works perfectly fine on the first cluster: > > >> > > >> $ srun --mem=32G --pty /bin/bash > > >> > > >> srun: job 93911 queued and waiting for resources > > >> srun: job 93911 has been allocated resources > > >> > > >> and on the resulting shell on the compute node: > > >> > > >> $ /mnt/local/ollama/ollama help > > >> > > >> and the ollama help message appears as expected. > > >> > > >> However, on the second cluster: > > >> > > >> $ srun --mem=32G --pty /bin/bash > > >> srun: job 3 queued and waiting for resources > > >> srun: job 3 has been allocated resources > > >> > > >> and on the resulting shell on the compute node: > > >> > > >> $ /mnt/local/ollama/ollama help > > >> fatal error: failed to reserve page summary memory > > >> runtime stack: > > >> runtime.throw({0x1240c66?, 0x154fa39a1008?}) > > >> runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 > > >> pc=0x4605dc > > >> runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) > > >> runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 > > >> sp=0x7ffe6be32648 pc=0x456b7c > > >> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) > > >> runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 > > >> pc=0x454565 > > >> runtime.(*mheap).init(0x127b47e0) > > >> runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 > > >> pc=0x451885 > > >> runtime.mallocinit() > > >> runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 > > >> pc=0x434f97 > > >> runtime.schedinit() > > >> runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 > > >> pc=0x464397 > > >> runtime.rt0_go() > > >> runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 > > >> pc=0x49421c > > >> > > >> > > >> If I ssh directly to the same node on that second cluster (skipping > > >> Slurm entirely), and run the same "/mnt/local/ollama/ollama help" > > >> command, it works perfectly fine. > > >> > > >> > > >> My first thought was that it might be related to cgroups. I switched > > >> the second cluster from cgroups v2 to v1 and tried again, no > > >> difference. I tried disabling cgroups on the second cluster by removing > > >> all cgroups references in the slurm.conf file but that also made no > > >> difference. > > >> > > >> > > >> My guess is something changed with regards to srun between these two > > >> Slurm versions, but I'm not sure what. > > >> > > >> Any thoughts on what might be happening and/or a way to get this to work > > >> on the second cluster? Essentially I need a way to request an > > >> interactive shell through Slurm that is associated with the requested > > >> resources. Should we be using something other than srun for this? > > >> > > >> > > >> Thank you, > > >> > > >> -Dj > > >> > > >> > > >> > > >> -- > > >> slurm-users mailing list -- slurm-users@lists.schedmd.com > > >> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com > > > > > > -- > > slurm-users mailing list -- slurm-users@lists.schedmd.com > > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com