Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Bjørn-Helge Mevik Mon, 07 Oct 2019 23:37:16 -0700

Jean-mathieu CHANTREIN <jean-mathieu.chantr...@univ-angers.fr> writes:


> I tried using, in slurm.conf 
> TaskPlugin=task/affinity, task/cgroup 
> SelectTypeParameters=CR_CPU_Memory 
> MemLimitEnforce=yes 
>
> and in cgroup.conf: 
> CgroupAutomount=yes 
> ConstrainCores=yes 
> ConstrainRAMSpace=yes 
> ConstrainSwapSpace=yes 
> MaxSwapPercent=10 
> TaskAffinity=no 

We have a very similar setup, the biggest difference being that we have
MemLimitEnforce=no, and leave the killing to the kernel's cgroup.  For
us, jobs are killed as they should.  Here are a couple of things you
could check:

- Does it work if you remove the space in "TaskPlugin=task/affinity,
  task/cgroup"? (Slurm can be quite picky when reading slurm.conf).

- See in slurmd.log on the node(s) of the job if cgroup actually gets
  activated and starts limit memory for the job, or if there are any
  errors related to cgroup.

- While a job is running, see in the cgroup memory directory (typically
  /sys/fs/cgroup/memory/slurm/uid_<num>/job_<num> for the job (on the
  compute node).  Does the values there, for instance
  memory.limit_in_bytes and memory.max_usage_in_bytes, make sense?

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

signature.asc
Description: PGP signature

Re: [slurm-users] How to automatically kill a job that exceeds its memory limits (--mem-per-cpu)?

Reply via email to