[slurm-users] Re: srun weirdness

greent10--- via slurm-users Wed, 15 May 2024 04:22:08 -0700

Hi,

When we first migrated to Slurm from PBS one of the strangest issues we hit was 
that ulimit settings are inherited from the submission host which could explain 
the different between ssh'ing into the machine (and the default ulimit being 
applied) and with running a job via srun.


You could use:

srun --propagate=NONE --mem=32G --pty bash

I still find Slurm inheriting ulimit and environment variables from the 
submission host an odd default behaviour.

Tom

--
Thomas Green                         Senior Programmer
ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB
Tel: +44 (0)29 208 79269             Fax: +44 (0)29 208 70734
Email: green...@cardiff.ac.uk        Web: http://www.cardiff.ac.uk/arcca

Thomas Green                         Uwch Raglennydd
ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB
Ffôn: +44 (0)29 208 79269            Ffacs: +44 (0)29 208 70734
E-bost: green...@caerdydd.ac.uk      Gwefan: http://www.caerdydd.ac.uk/arcca

-----Original Message-----
From: Hermann Schwärzler via slurm-users <slurm-users@lists.schedmd.com> 
Sent: Wednesday, May 15, 2024 9:45 AM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: srun weirdness

External email to Cardiff University - Take care when replying/opening 
attachments or links.
Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor 
atodiadau neu ddolenni.



Hi Dj,

could be a memory-limits related problem. What is the output of

  ulimit -l -m -v -s

in both interactive job-shells?

You are using cgroups-v1 now, right?
In that case what is the respective content of

  /sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes

in both shells?

Regards,
Hemann


On 5/14/24 20:38, Dj Merrill via slurm-users wrote:
> I'm running into a strange issue and I'm hoping another set of brains 
> looking at this might help.  I would appreciate any feedback.
>
> I have two Slurm Clusters.  The first cluster is running Slurm 21.08.8 
> on Rocky Linux 8.9 machines.  The second cluster is running Slurm
> 23.11.6 on Rocky Linux 9.4 machines.
>
> This works perfectly fine on the first cluster:
>
> $ srun --mem=32G --pty /bin/bash
>
> srun: job 93911 queued and waiting for resources
> srun: job 93911 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
>
> and the ollama help message appears as expected.
>
> However, on the second cluster:
>
> $ srun --mem=32G --pty /bin/bash
> srun: job 3 queued and waiting for resources
> srun: job 3 has been allocated resources
>
> and on the resulting shell on the compute node:
>
> $ /mnt/local/ollama/ollama help
> fatal error: failed to reserve page summary memory runtime stack:
> runtime.throw({0x1240c66?, 0x154fa39a1008?})
>      runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 
> pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?)
>      runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8
> sp=0x7ffe6be32648 pc=0x456b7c
> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0)
>      runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 
> sp=0x7ffe6be326b8
> pc=0x454565
> runtime.(*mheap).init(0x127b47e0)
>      runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8
> pc=0x451885
> runtime.mallocinit()
>      runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720
> pc=0x434f97
> runtime.schedinit()
>      runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758
> pc=0x464397
> runtime.rt0_go()
>      runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 
> sp=0x7ffe6be327d0 pc=0x49421c
>
>
> If I ssh directly to the same node on that second cluster (skipping 
> Slurm entirely), and run the same "/mnt/local/ollama/ollama help"
> command, it works perfectly fine.
>
>
> My first thought was that it might be related to cgroups.  I switched 
> the second cluster from cgroups v2 to v1 and tried again, no 
> difference.  I tried disabling cgroups on the second cluster by 
> removing all cgroups references in the slurm.conf file but that also 
> made no difference.
>
>
> My guess is something changed with regards to srun between these two 
> Slurm versions, but I'm not sure what.
>
> Any thoughts on what might be happening and/or a way to get this to 
> work on the second cluster?  Essentially I need a way to request an 
> interactive shell through Slurm that is associated with the requested 
> resources.  Should we be using something other than srun for this?
>
>
> Thank you,
>
> -Dj
>
>
>

--
slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send 
an email to slurm-users-le...@lists.schedmd.com

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: srun weirdness

Reply via email to