PropagateResourceLimitsExcept won't do it?
________________________________________ Od: Dj Merrill via slurm-users <slurm-users@lists.schedmd.com> Poslano: sreda, 15. maj 2024 09:43 Za: slurm-users@lists.schedmd.com Zadeva: [EXTERNAL] [slurm-users] Re: srun weirdness Thank you Hemann and Tom! That was it. The new cluster has a virtual memory limit on the login host, and the old cluster did not. It doesn't look like there is any way to set a default to override the srun behaviour of passing those resource limits to the shell, so I may consider removing those limits on the login host so folks don't have to manually specify this every time. I really appreciate the help! -Dj On 5/15/24 07:20, greent10--- via slurm-users wrote: > Hi, > > When we first migrated to Slurm from PBS one of the strangest issues we hit > was that ulimit settings are inherited from the submission host which could > explain the different between ssh'ing into the machine (and the default > ulimit being applied) and with running a job via srun. > > You could use: > > srun --propagate=NONE --mem=32G --pty bash > > I still find Slurm inheriting ulimit and environment variables from the > submission host an odd default behaviour. > > Tom > > -- > Thomas Green Senior Programmer > ARCCA, Redwood Building, King Edward VII Avenue, Cardiff, CF10 3NB > Tel: +44 (0)29 208 79269 Fax: +44 (0)29 208 70734 > Email: green...@cardiff.ac.uk Web: > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.cardiff.ac.uk_arcca&d=DwIGaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=897kjkV-MEeU1IVizIfc5Q&m=94Q7i1VRjoZjBYeRehmS8_ns1RjitmxaanQjTsZeT4nVn5jZjxy9ARfUeywCHmmo&s=zHnwNoh0Qk3EBsMpU-Mum-ARPhKLa65Arp1ndQvw4cU&e= > > Thomas Green Uwch Raglennydd > ARCCA, Adeilad Redwood, King Edward VII Avenue, Caerdydd, CF10 3NB > Ffôn: +44 (0)29 208 79269 Ffacs: +44 (0)29 208 70734 > E-bost: green...@caerdydd.ac.uk Gwefan: > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.caerdydd.ac.uk_arcca&d=DwIGaQ&c=CJqEzB1piLOyyvZjb8YUQw&r=897kjkV-MEeU1IVizIfc5Q&m=94Q7i1VRjoZjBYeRehmS8_ns1RjitmxaanQjTsZeT4nVn5jZjxy9ARfUeywCHmmo&s=2DevPnVhkvH0gqoWZ8tnKTPTLUPLaYGn_4zx70McYxg&e= > > -----Original Message----- > From: Hermann Schwärzler via slurm-users <slurm-users@lists.schedmd.com> > Sent: Wednesday, May 15, 2024 9:45 AM > To: slurm-users@lists.schedmd.com > Subject: [slurm-users] Re: srun weirdness > > External email to Cardiff University - Take care when replying/opening > attachments or links. > Nid ebost mewnol o Brifysgol Caerdydd yw hwn - Cymerwch ofal wrth ateb/agor > atodiadau neu ddolenni. > > > > Hi Dj, > > could be a memory-limits related problem. What is the output of > > ulimit -l -m -v -s > > in both interactive job-shells? > > You are using cgroups-v1 now, right? > In that case what is the respective content of > > /sys/fs/cgroup/memory/slurm_*/uid_$(id -u)/job_*/memory.limit_in_bytes > > in both shells? > > Regards, > Hemann > > > On 5/14/24 20:38, Dj Merrill via slurm-users wrote: >> I'm running into a strange issue and I'm hoping another set of brains >> looking at this might help. I would appreciate any feedback. >> >> I have two Slurm Clusters. The first cluster is running Slurm 21.08.8 >> on Rocky Linux 8.9 machines. The second cluster is running Slurm >> 23.11.6 on Rocky Linux 9.4 machines. >> >> This works perfectly fine on the first cluster: >> >> $ srun --mem=32G --pty /bin/bash >> >> srun: job 93911 queued and waiting for resources >> srun: job 93911 has been allocated resources >> >> and on the resulting shell on the compute node: >> >> $ /mnt/local/ollama/ollama help >> >> and the ollama help message appears as expected. >> >> However, on the second cluster: >> >> $ srun --mem=32G --pty /bin/bash >> srun: job 3 queued and waiting for resources >> srun: job 3 has been allocated resources >> >> and on the resulting shell on the compute node: >> >> $ /mnt/local/ollama/ollama help >> fatal error: failed to reserve page summary memory runtime stack: >> runtime.throw({0x1240c66?, 0x154fa39a1008?}) >> runtime/panic.go:1023 +0x5c fp=0x7ffe6be32648 sp=0x7ffe6be32618 >> pc=0x4605dc runtime.(*pageAlloc).sysInit(0x127b47e8, 0xf8?) >> runtime/mpagealloc_64bit.go:81 +0x11c fp=0x7ffe6be326b8 >> sp=0x7ffe6be32648 pc=0x456b7c >> runtime.(*pageAlloc).init(0x127b47e8, 0x127b47e0, 0x128d88f8, 0x0) >> runtime/mpagealloc.go:320 +0x85 fp=0x7ffe6be326e8 >> sp=0x7ffe6be326b8 >> pc=0x454565 >> runtime.(*mheap).init(0x127b47e0) >> runtime/mheap.go:769 +0x165 fp=0x7ffe6be32720 sp=0x7ffe6be326e8 >> pc=0x451885 >> runtime.mallocinit() >> runtime/malloc.go:454 +0xd7 fp=0x7ffe6be32758 sp=0x7ffe6be32720 >> pc=0x434f97 >> runtime.schedinit() >> runtime/proc.go:785 +0xb7 fp=0x7ffe6be327d0 sp=0x7ffe6be32758 >> pc=0x464397 >> runtime.rt0_go() >> runtime/asm_amd64.s:349 +0x11c fp=0x7ffe6be327d8 >> sp=0x7ffe6be327d0 pc=0x49421c >> >> >> If I ssh directly to the same node on that second cluster (skipping >> Slurm entirely), and run the same "/mnt/local/ollama/ollama help" >> command, it works perfectly fine. >> >> >> My first thought was that it might be related to cgroups. I switched >> the second cluster from cgroups v2 to v1 and tried again, no >> difference. I tried disabling cgroups on the second cluster by >> removing all cgroups references in the slurm.conf file but that also >> made no difference. >> >> >> My guess is something changed with regards to srun between these two >> Slurm versions, but I'm not sure what. >> >> Any thoughts on what might be happening and/or a way to get this to >> work on the second cluster? Essentially I need a way to request an >> interactive shell through Slurm that is associated with the requested >> resources. Should we be using something other than srun for this? >> >> >> Thank you, >> >> -Dj >> >> >> -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com