Looks more like a runtime environment issue. Check the binaries:
ldd /mnt/local/ollama/ollama on both clusters and comparing the output may give some hints. Best, Feng On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users <slurm-users@lists.schedmd.com> wrote: > > I'm running into a strange issue and I'm hoping another set of brains > looking at this might help. I would appreciate any feedback. > > I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 > on Rocky Linux 8.9 machines. The second cluster is running Slurm > 23.11.6 on Rocky Linux 9.4 machines. > > This works perfectly fine on the first cluster: > > $ srun --mem=32G --pty /bin/bash > > srun: job 93911 queued and waiting for resources > srun: job 93911 has been allocated resources > > and on the resulting shell on the compute node: > > $ /mnt/local/ollama/ollama help > > and the ollama help message appears as expected. > > However, on the second cluster: > > $ srun --mem=32G --pty /bin/bash > srun: job 3 queued and waiting for resources > srun: job 3 has been allocated resources > > and on the resulting shell on the compute node: > > $ /mnt/local/ollama/ollama help > fatal error: failed to reserve page summary memory > runtime stack: > runtime.throw({0x1240c66?, 0x154fa39a1008?}) > runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 > pc=0x4605dc > runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) > runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 > sp=0x7ffe6be32648 pc=0x456b7c > runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) > runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8 > pc=0x454565 > runtime.(*mheap).init(0x127b47e0) > runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 > pc=0x451885 > runtime.mallocinit() > runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 > pc=0x434f97 > runtime.schedinit() > runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 > pc=0x464397 > runtime.rt0_go() > runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0 > pc=0x49421c > > > If I ssh directly to the same node on that second cluster (skipping > Slurm entirely), and run the same "/mnt/local/ollama/ollama help" > command, it works perfectly fine. > > > My first thought was that it might be related to cgroups. I switched > the second cluster from cgroups v2 to v1 and tried again, no > difference. I tried disabling cgroups on the second cluster by removing > all cgroups references in the slurm.conf file but that also made no > difference. > > > My guess is something changed with regards to srun between these two > Slurm versions, but I'm not sure what. > > Any thoughts on what might be happening and/or a way to get this to work > on the second cluster? Essentially I need a way to request an > interactive shell through Slurm that is associated with the requested > resources. Should we be using something other than srun for this? > > > Thank you, > > -Dj > > > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com