Re: [slurm-users] Kill job when child process gets OOM-killed

Loris Bennett Tue, 08 Jun 2021 02:06:39 -0700

Dear Arthur,

Arthur Gilly <arthur.gi...@helmholtz-muenchen.de> writes:


> Dear Slurm users,
>
>  
>
> I am looking for a SLURM setting that will kill a job immediately when any 
> subprocess of that job hits an OOM limit. Several posts have touched upon 
> that, e.g: 
> https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04091.html  and
> https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04190.html or 
> https://bugs.schedmd.com/show_bug.cgi?id=3216 but I cannot find an answer 
> that works in our setting.
>
>  
>
> The two options I have found are:
>
> 1 Set shebang to #!/bin/bash -e, which we don’t want to do as we’d need to 
> change this for hundreds of scripts from another cluster where we had a 
> different scheduler, AND it would kill tasks for other runtime errors (e.g. 
> if one command in the
>  script doesn’t find a file).
>
> 2 Set KillOnBadExit=1. I am puzzled by this one. This is supposed to be 
> overridden by srun’s -K option. Using the example below, srun -K --mem=1G 
> ./multalloc.sh would be expected to kill the job at the first OOM. But it 
> doesn’t, and happily
>  keeps reporting 3 oom-kill events. So, will this work?
>
>  
>
> The reason we want this is that we have script that execute programs in 
> loops. These programs are slow and memory intensive. When the first one 
> crashes for OOM, the next iterations also crash. In the current setup, we are 
> wasting days
> executing loops where every iteration crashes after an hour or so due to OOM.

Not an answer to your question, but if your runs are independent, would
using a job array help you here?

Cheers,

Loris

> We are using cgroups (and we want to keep them) with the following config:
>
> CgroupAutomount=yes
>
> ConstrainCores=yes
>
> ConstrainDevices=yes
>
> ConstrainKmemSpace=no
>
> ConstrainRAMSpace=yes
>
> ConstrainSwapSpace=yes
>
> MaxSwapPercent=10
>
> TaskAffinity=no
>
>  
>
> Relevant bits from slurm.conf:
>
> SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
>
> SelectType=select/cons_tres
>
> GresTypes=gpu,mps,bandwidth
>
>  
>
>  
>
> Very simple example:
>
> #!/bin/bash
>
> # multalloc.sh – each line is a very simple cpp program that allocates a 8Gb 
> vector and fills it with random floats
>
> echo one
>
> ./alloc8Gb
>
> echo two
>
> ./alloc8Gb
>
> echo three
>
> ./alloc8Gb
>
> echo done.
>
>  
>
> This is submitted as follows:
>
>  
>
> sbatch --mem=1G ./multalloc.sh
>
>  
>
> The log is :
>
> one
>
> ./multalloc.sh: line 4: 231155 Killed                  ./alloc8Gb
>
> two
>
> ./multalloc.sh: line 6: 231181 Killed                  ./alloc8Gb
>
> three
>
> ./multalloc.sh: line 8: 231263 Killed                  ./alloc8Gb
>
> done.
>
> slurmstepd: error: Detected 3 oom-kill event(s) in StepId=3130111.batch 
> cgroup. Some of your processes may have been killed by the cgroup 
> out-of-memory handler.
>
>  
>
> I am expecting an OOM job kill right before “two”.
>
>  
>
> Any help appreciated.
>
>  
>
> Best regards,
>
>  
>
> Arthur
>
>  
>
>  
>
> -------------------------------------------------------------
>
> Dr. Arthur Gilly
>
> Head of Analytics
>
> Institute of Translational Genomics
>
> Helmholtz-Centre Munich (HMGU)
>
> -------------------------------------------------------------
>
>  
>
> Helmholtz Zentrum München 
> Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) 
> Ingolstädter Landstr. 1 
> 85764 Neuherberg 
> www.helmholtz-muenchen.de 
> Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling 
> Geschäftsführung: Prof. Dr. med. Dr. h.c. Matthias Tschöp, Kerstin Günther
> Registergericht: Amtsgericht München HRB 6466 
> USt-IdNr: DE 129521671 
>
-- 
Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin         Email loris.benn...@fu-berlin.de

Re: [slurm-users] Kill job when child process gets OOM-killed

Reply via email to