Hi Sean, I don't see PrologFlags=Contain in your slurm.conf. It is one of the entries required to activate the cgroup containment: https://slurm.schedmd.com/cgroup.conf.html#OPT_/etc/slurm/slurm.conf
Best, -Sean On Tue, Nov 8, 2022 at 8:16 AM Sean McGrath <smcg...@tchpc.tcd.ie> wrote: > Hi, > > I can't get cgroups to constrain memory or cores. If anyone is able to > point out what I am doing wrong I would be very grateful please. > > Testing: > > Request a core and 2G of memory, log into it and compile a binary that > just allocates memory quickly: > > $ salloc -n 1 --mem=2G > $ ssh $SLURM_NODELIST > $ cat stoopid-memory-overallocation.c > /* > * > * Sometimes you need to over allocate the memory available to you. > * This does so splendidly. I just hope you have limits set to kill it! > * > */ > > int main() > { > while(1) > { > void *m = malloc(1024*1024); > memset(m,0,1024*1024); > } > return 0; > } > $ gcc -o stoopid-memory-overallocation.x stoopid-memory-overallocation.c > > Checking memory usage before as a baseline: > > $ free -g > total used free shared buff/cache > available > Mem: 251 1 246 0 3 > 248 > Swap: 7 0 7 > > Launch the memory over allocation and check memory use subsequently and > see that 34G has been allocated when I expect it to be constrained to 2G: > > $ ./stoopid-memory-overallocation.x & > $ sleep 10 && free -g > total used free shared buff/cache > available > Mem: 251 34 213 0 3 > 215 > Swap: 7 0 7 > > Run another process to check cpu constraints: > > $ ./stoopid-memory-overallocation.x & > > Check it with top and I can see that the 2 processes are running > simultaneously: > > $ top > top - 13:04:44 up 13 days, 23:39, 2 users, load average: 0.63, 0.27, 0.11 > Tasks: 525 total, 3 running, 522 sleeping, 0 stopped, 0 zombie > %Cpu(s): 0.7 us, 5.5 sy, 0.0 ni, 93.7 id, 0.0 wa, 0.0 hi, 0.0 si, > 0.0 st > MiB Mem : 257404.1 total, 181300.3 free, 72588.6 used, 3515.2 buff/cache > MiB Swap: 8192.0 total, 8192.0 free, 0.0 used. 183300.3 avail Mem > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > 120978 smcgrat 20 0 57.6g 57.6g 968 R 100.0 22.9 0:22.63 > stoopid-memory- > 120981 smcgrat 20 0 11.6g 11.6g 952 R 100.0 4.6 0:04.57 > stoopid-memory- > ... > > Is this actually a valid test case or am I doing something else wrong? > > Thanks > > Sean > > Setup details: > > Ubuntu 20.04.5 LTS (Focal Fossa). > slurm 21.08.8-2. > cgroup-tools version 0.41-10 installed. > > The following was set in /etc/default/grub and update-grub run: > > GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1" > > Relevant parts of scontrol show conf > > JobAcctGatherType = jobacct_gather/none > ProctrackType = proctrack/cgroup > TaskPlugin = task/cgroup > TaskPluginParam = (null type) > > > The contents of the full slurm.conf > > ClusterName=neuro > SlurmctldHost=neuro01(192.168.49.254) > AuthType=auth/munge > CommunicationParameters=block_null_hash > CryptoType=crypto/munge > Epilog=/home/support/slurm/etc/slurm.epilog.local > EpilogSlurmctld=/home/support/slurm/etc/slurm.epilogslurmctld > JobRequeue=0 > MaxJobCount=30000 > MpiDefault=none > Prolog=/home/support/slurm/etc/prolog > ReturnToService=2 > SlurmctldPidFile=/var/run/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurmd.pid > SlurmdPort=6818 > SlurmUser=root > StateSaveLocation=/var/slurm_state/neuro > SwitchType=switch/none > TaskPlugin=task/cgroup > ProctrackType=proctrack/cgroup > RebootProgram=/sbin/reboot > InactiveLimit=0 > KillWait=30 > MinJobAge=300 > SlurmctldTimeout=300 > SlurmdTimeout=300 > Waittime=0 > SchedulerType=sched/backfill > SelectType=select/cons_res > SelectTypeParameters=CR_Core > AccountingStorageHost=service01 > AccountingStorageType=accounting_storage/slurmdbd > JobCompType=jobcomp/none > JobAcctGatherFrequency=30 > SlurmctldDebug=3 > SlurmctldLogFile=/var/log/slurm.log > SlurmdDebug=3 > SlurmdLogFile=/var/log/slurm.log > DefMemPerNode=257300 > MaxMemPerNode=257300 > NodeName=neuro-n01-mgt RealMemory=257300 Sockets=2 CoresPerSocket=16 > State=UNKNOWN > NodeName=neuro-n02-mgt RealMemory=257300 Sockets=2 CoresPerSocket=16 > State=UNKNOWN > PartitionName=compute Nodes=ALL Default=YES MaxTime=5760 State=UP > Shared=YES > > > cgroup.conf file contents: > > CgroupAutomount=yes > ConstrainCores=yes > ConstrainRAMSpace=yes > TaskAffinity=no > >