Hi Jürgen, you're looking for KillOnBadExit in the slurm.conf: KillOnBadExit If set to 1, a step will be terminated immediately if any task is crashed or aborted, as indicated by a non-zero exit code. With the default value of 0, if one of the processes is crashed or aborted the other processes will continue to run while the crashed or aborted process waits. The user can override this configuration parameter by using srun's -K, --kill-on-bad-exit.
this should terminate the job if a step or a process gets oom-killed. Best, Marcus On 19-10-08 10:36, Juergen Salk wrote: > * Bjørn-Helge Mevik <b.h.me...@usit.uio.no> [191008 08:34]: > > Jean-mathieu CHANTREIN <jean-mathieu.chantr...@univ-angers.fr> writes: > > > > > I tried using, in slurm.conf > > > TaskPlugin=task/affinity, task/cgroup > > > SelectTypeParameters=CR_CPU_Memory > > > MemLimitEnforce=yes > > > > > > and in cgroup.conf: > > > CgroupAutomount=yes > > > ConstrainCores=yes > > > ConstrainRAMSpace=yes > > > ConstrainSwapSpace=yes > > > MaxSwapPercent=10 > > > TaskAffinity=no > > > > We have a very similar setup, the biggest difference being that we have > > MemLimitEnforce=no, and leave the killing to the kernel's cgroup. For > > us, jobs are killed as they should. [...] > > Hello Bjørn-Helge, > > that is interesting. We have a very similar setup as well. However, in > our Slurm test cluster I have noticed that it is not the *job* that > gets killed. Instead, the OOM killer terminates one (or more) > *processes* but keeps the job itself running in a potentially > unhealthy state. > > Is there a way to tell Slurm to terminate the whole job as soon as > the first OOM kill event takes place during execution? > > Best regards > Jürgen > > -- > Jürgen Salk > Scientific Software & Compute Services (SSCS) > Kommunikations- und Informationszentrum (kiz) > Universität Ulm > Telefon: +49 (0)731 50-22478 > Telefax: +49 (0)731 50-22471 > -- Marcus Vincent Boden, M.Sc. Arbeitsgruppe eScience Tel.: +49 (0)551 201-2191 E-Mail: mbo...@gwdg.de --------------------------------------- Gesellschaft fuer wissenschaftliche Datenverarbeitung mbH Goettingen (GWDG) Am Fassberg 11, 37077 Goettingen URL: http://www.gwdg.de E-Mail: g...@gwdg.de Tel.: +49 (0)551 201-1510 Fax: +49 (0)551 201-2150 Geschaeftsfuehrer: Prof. Dr. Ramin Yahyapour Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger Sitz der Gesellschaft: Goettingen Registergericht: Goettingen Handelsregister-Nr. B 598 ---------------------------------------
smime.p7s
Description: S/MIME cryptographic signature