Dear Arthur, Arthur Gilly <arthur.gi...@helmholtz-muenchen.de> writes:
> Dear Slurm users, > > > > I am looking for a SLURM setting that will kill a job immediately when any > subprocess of that job hits an OOM limit. Several posts have touched upon > that, e.g: > https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04091.html and > https://www.mail-archive.com/slurm-users@lists.schedmd.com/msg04190.html or > https://bugs.schedmd.com/show_bug.cgi?id=3216 but I cannot find an answer > that works in our setting. > > > > The two options I have found are: > > 1 Set shebang to #!/bin/bash -e, which we don’t want to do as we’d need to > change this for hundreds of scripts from another cluster where we had a > different scheduler, AND it would kill tasks for other runtime errors (e.g. > if one command in the > script doesn’t find a file). > > 2 Set KillOnBadExit=1. I am puzzled by this one. This is supposed to be > overridden by srun’s -K option. Using the example below, srun -K --mem=1G > ./multalloc.sh would be expected to kill the job at the first OOM. But it > doesn’t, and happily > keeps reporting 3 oom-kill events. So, will this work? > > > > The reason we want this is that we have script that execute programs in > loops. These programs are slow and memory intensive. When the first one > crashes for OOM, the next iterations also crash. In the current setup, we are > wasting days > executing loops where every iteration crashes after an hour or so due to OOM. Not an answer to your question, but if your runs are independent, would using a job array help you here? Cheers, Loris > We are using cgroups (and we want to keep them) with the following config: > > CgroupAutomount=yes > > ConstrainCores=yes > > ConstrainDevices=yes > > ConstrainKmemSpace=no > > ConstrainRAMSpace=yes > > ConstrainSwapSpace=yes > > MaxSwapPercent=10 > > TaskAffinity=no > > > > Relevant bits from slurm.conf: > > SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE > > SelectType=select/cons_tres > > GresTypes=gpu,mps,bandwidth > > > > > > Very simple example: > > #!/bin/bash > > # multalloc.sh – each line is a very simple cpp program that allocates a 8Gb > vector and fills it with random floats > > echo one > > ./alloc8Gb > > echo two > > ./alloc8Gb > > echo three > > ./alloc8Gb > > echo done. > > > > This is submitted as follows: > > > > sbatch --mem=1G ./multalloc.sh > > > > The log is : > > one > > ./multalloc.sh: line 4: 231155 Killed ./alloc8Gb > > two > > ./multalloc.sh: line 6: 231181 Killed ./alloc8Gb > > three > > ./multalloc.sh: line 8: 231263 Killed ./alloc8Gb > > done. > > slurmstepd: error: Detected 3 oom-kill event(s) in StepId=3130111.batch > cgroup. Some of your processes may have been killed by the cgroup > out-of-memory handler. > > > > I am expecting an OOM job kill right before “two”. > > > > Any help appreciated. > > > > Best regards, > > > > Arthur > > > > > > ------------------------------------------------------------- > > Dr. Arthur Gilly > > Head of Analytics > > Institute of Translational Genomics > > Helmholtz-Centre Munich (HMGU) > > ------------------------------------------------------------- > > > > Helmholtz Zentrum München > Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) > Ingolstädter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling > Geschäftsführung: Prof. Dr. med. Dr. h.c. Matthias Tschöp, Kerstin Günther > Registergericht: Amtsgericht München HRB 6466 > USt-IdNr: DE 129521671 > -- Dr. Loris Bennett (Hr./Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de