Dear Arthur,

Arthur Gilly <> writes:

> Dear Slurm users,
> I am looking for a SLURM setting that will kill a job immediately when any 
> subprocess of that job hits an OOM limit. Several posts have touched upon 
> that, e.g: 
>  and
> or 
> but I cannot find an answer 
> that works in our setting.
> The two options I have found are:
> 1 Set shebang to #!/bin/bash -e, which we don’t want to do as we’d need to 
> change this for hundreds of scripts from another cluster where we had a 
> different scheduler, AND it would kill tasks for other runtime errors (e.g. 
> if one command in the
>  script doesn’t find a file).
> 2 Set KillOnBadExit=1. I am puzzled by this one. This is supposed to be 
> overridden by srun’s -K option. Using the example below, srun -K --mem=1G 
> ./ would be expected to kill the job at the first OOM. But it 
> doesn’t, and happily
>  keeps reporting 3 oom-kill events. So, will this work?
> The reason we want this is that we have script that execute programs in 
> loops. These programs are slow and memory intensive. When the first one 
> crashes for OOM, the next iterations also crash. In the current setup, we are 
> wasting days
> executing loops where every iteration crashes after an hour or so due to OOM.

Not an answer to your question, but if your runs are independent, would
using a job array help you here?



> We are using cgroups (and we want to keep them) with the following config:
> CgroupAutomount=yes
> ConstrainCores=yes
> ConstrainDevices=yes
> ConstrainKmemSpace=no
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> MaxSwapPercent=10
> TaskAffinity=no
> Relevant bits from slurm.conf:
> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
> SelectType=select/cons_tres
> GresTypes=gpu,mps,bandwidth
> Very simple example:
> #!/bin/bash
> # – each line is a very simple cpp program that allocates a 8Gb 
> vector and fills it with random floats
> echo one
> ./alloc8Gb
> echo two
> ./alloc8Gb
> echo three
> ./alloc8Gb
> echo done.
> This is submitted as follows:
> sbatch --mem=1G ./
> The log is :
> one
> ./ line 4: 231155 Killed                  ./alloc8Gb
> two
> ./ line 6: 231181 Killed                  ./alloc8Gb
> three
> ./ line 8: 231263 Killed                  ./alloc8Gb
> done.
> slurmstepd: error: Detected 3 oom-kill event(s) in StepId=3130111.batch 
> cgroup. Some of your processes may have been killed by the cgroup 
> out-of-memory handler.
> I am expecting an OOM job kill right before “two”.
> Any help appreciated.
> Best regards,
> Arthur
> -------------------------------------------------------------
> Dr. Arthur Gilly
> Head of Analytics
> Institute of Translational Genomics
> Helmholtz-Centre Munich (HMGU)
> -------------------------------------------------------------
> Helmholtz Zentrum München 
> Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) 
> Ingolstädter Landstr. 1 
> 85764 Neuherberg 
> Aufsichtsratsvorsitzende: Prof. Dr. Veronika von Messling 
> Geschäftsführung: Prof. Dr. med. Dr. h.c. Matthias Tschöp, Kerstin Günther
> Registergericht: Amtsgericht München HRB 6466 
> USt-IdNr: DE 129521671 
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin         Email

Reply via email to