> On 19-10-08 10:36, Juergen Salk wrote: > > * Bjørn-Helge Mevik <b.h.me...@usit.uio.no> [191008 08:34]: > > > Jean-mathieu CHANTREIN <jean-mathieu.chantr...@univ-angers.fr> writes: > > > > > > > I tried using, in slurm.conf > > > > TaskPlugin=task/affinity, task/cgroup > > > > SelectTypeParameters=CR_CPU_Memory > > > > MemLimitEnforce=yes > > > > > > > > and in cgroup.conf: > > > > CgroupAutomount=yes > > > > ConstrainCores=yes > > > > ConstrainRAMSpace=yes > > > > ConstrainSwapSpace=yes > > > > MaxSwapPercent=10 > > > > TaskAffinity=no > > > > > > We have a very similar setup, the biggest difference being that we have > > > MemLimitEnforce=no, and leave the killing to the kernel's cgroup. For > > > us, jobs are killed as they should. [...] > > > > that is interesting. We have a very similar setup as well. However, in > > our Slurm test cluster I have noticed that it is not the *job* that > > gets killed. Instead, the OOM killer terminates one (or more) > > *processes* but keeps the job itself running in a potentially > > unhealthy state. > > > > Is there a way to tell Slurm to terminate the whole job as soon as > > the first OOM kill event takes place during execution?
* Marcus Boden <mbo...@gwdg.de> [191008 10:46]: > > you're looking for KillOnBadExit in the slurm.conf: > KillOnBadExit > > If set to 1, a step will be terminated immediately if any task > is crashed or aborted, as indicated by a non-zero exit code. > With the default value of 0, if one of the processes is crashed > or aborted the other processes will continue to run while the > crashed or aborted process waits. The user can override this > configuration parameter by using srun's -K, --kill-on-bad-exit. > > this should terminate the job if a step or a process gets oom-killed. Hi Marcus, thank you. I did not consider `KillOnBadExit=1´ so far. It seems this does indeed kill the current job step if it hits the memory limit - but then happily proceeds with the next one. I've also noticed that, in order to work as described above, this requires all the processes to be launched via srun from within the batch script. Right? Admittedly, I am also somewhat scared about potential side effects with `KillOnBadExit=1´ set in a productive environment that needs to cope with all sorts of batch scripts. A non-zero exit code of some process may or may not harm the batch job whereas process(es) that get oom-killed most probably affect the job as a whole. Is `KillOnBadExit=1´ commonly used? Thanks again. Best regards Jürgen -- Jürgen Salk Scientific Software & Compute Services (SSCS) Kommunikations- und Informationszentrum (kiz) Universität Ulm Telefon: +49 (0)731 50-22478 Telefax: +49 (0)731 50-22471