[slurm-users] Re: srun weirdness

Dj Merrill via slurm-users Tue, 14 May 2024 12:42:05 -0700

Hi Feng,
Thank you for replying.

It is the same binary on the same machine that fails.


If I ssh to a compute node on the second cluster, it works fine.

It fails when running in an interactive shell obtained with srun on thatsame compute node.

I agree that it seems like a runtime environment difference between theSSH shell and the srun obtained shell.

This is the ldd from within the srun obtained shell (and gives the errorwhen run):


[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
    linux-vdso.so.1 (0x00007ffde81ee000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x0000154f732cc000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000154f732c7000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x0000154f73000000)
    librt.so.1 => /lib64/librt.so.1 (0x0000154f732c2000)
    libdl.so.2 => /lib64/libdl.so.2 (0x0000154f732bb000)
    libm.so.6 => /lib64/libm.so.6 (0x0000154f72f25000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000154f732a0000)
    libc.so.6 => /lib64/libc.so.6 (0x0000154f72c00000)
    /lib64/ld-linux-x86-64.so.2 (0x0000154f732f8000)

This is the ldd from the same exact node within an SSH shell which runsfine:


[deej@moose66 ~]$ ldd /mnt/local/ollama/ollama
    linux-vdso.so.1 (0x00007fffa66ff000)
    libresolv.so.2 => /lib64/libresolv.so.2 (0x000014a9d82da000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x000014a9d82d5000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x000014a9d8000000)
    librt.so.1 => /lib64/librt.so.1 (0x000014a9d82d0000)
    libdl.so.2 => /lib64/libdl.so.2 (0x000014a9d82c9000)
    libm.so.6 => /lib64/libm.so.6 (0x000014a9d7f25000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000014a9d82ae000)
    libc.so.6 => /lib64/libc.so.6 (0x000014a9d7c00000)
    /lib64/ld-linux-x86-64.so.2 (0x000014a9d8306000)


-Dj



On 5/14/24 15:25, Feng Zhang via slurm-users wrote:

Looks more like a runtime environment issue.

Check the binaries:

ldd  /mnt/local/ollama/ollama

on both clusters and comparing the output may give some hints.

Best,

Feng

On Tue, May 14, 2024 at 2:41 PM Dj Merrill via slurm-users
<slurm-users@lists.schedmd.com> wrote:

I'm running into a strange issue and I'm hoping another set of brains
looking at this might help.  I would appreciate any feedback.

I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8
on Rocky Linux 8.9 machines.  The second cluster is running Slurm
23.11.6 on Rocky Linux 9.4 machines.

This works perfectly fine on the first cluster:

$ srun --mem=32G --pty /bin/bash

srun: job 93911 queued and waiting for resources
srun: job 93911 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help

and the ollama help message appears as expected.

However, on the second cluster:

$ srun --mem=32G --pty /bin/bash
srun: job 3 queued and waiting for resources
srun: job 3 has been allocated resources

and on the resulting shell on the compute node:

$ /mnt/local/ollama/ollama help
fatal error: failed to reserve page summary memory
runtime stack:
runtime.throw({0x1240c66?, 0x154fa39a1008?})
      runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618
pc=0x4605dc
runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
      runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
sp=0x7ffe6be32648 pc=0x456b7c
runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
      runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 sp=0x7ffe6be326b8
pc=0x454565
runtime.(*mheap).init(0x127b47e0)
      runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
pc=0x451885
runtime.mallocinit()
      runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
pc=0x434f97
runtime.schedinit()
      runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
pc=0x464397
runtime.rt0_go()
      runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 sp=0x7ffe6be327d0
pc=0x49421c


If I ssh directly to the same node on that second cluster (skipping
Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
command, it works perfectly fine.


My first thought was that it might be related to cgroups.  I switched
the second cluster from cgroups v2 to v1 and tried again, no
difference.  I tried disabling cgroups on the second cluster by removing
all cgroups references in the slurm.conf file but that also made no
difference.


My guess is something changed with regards to srun between these two
Slurm versions, but I'm not sure what.

Any thoughts on what might be happening and/or a way to get this to work
on the second cluster?  Essentially I need a way to request an
interactive shell through Slurm that is associated with the requested
resources.  Should we be using something other than srun for this?


Thank you,

-Dj



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com



--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: srun weirdness

Reply via email to