Hi, Thanks again for all the suggestions. It turns out that on our cluster we can't use the cgroups because of the old kernel, but setting JobAcctGatherParams=UsePSS resolved the problems.
Regards, Sergey On Fri, 2019-01-11 at 10:37 +0200, Janne Blomqvist wrote: > On 11/01/2019 08.29, Sergey Koposov wrote: > > Hi, > > > > I've recently migrated to slurm from pbs on our cluster. Because of that, > > now the job memory limits are > > strictly enforced and that causes my code to get killed. > > The trick is that my code uses memory mapping (i.e. mmap) of one single > > large file (~12 Gb) in each thread on each node. > > With this technique in the past despite the fact the file is (read-only) > > mmaped in say 16 threads, the actual memory footprint was still ~ 12 Gb. > > However, when I now do this in slurm, it thinks that each thread (or > > process) takes 12Gb and kills my processes. > > > > Does anyone has a way around this problem ? Other then stoping using Memory > > as a consumable resource, or faking that each node has more memory ? > > > > Here is an example slurm script that I'm running > > #!/bin/bash > > #SBATCH -N 1 # number of nodes > > #SBATCH --cpus-per-task=10 # number of cores > > #SBATCH --ntasks-per-node=1 > > #SBATCH --mem=125GB > > #SBATCH --array=0-4 > > > > sh script1.sh $SLURM_ARRAY_TASK_ID 5 > > > > The script1 essentially starts python which in turn create 10 > > multiprocessing processes each of which will mmap the large file. > > ------ > > In this case I'm forced to limit myself to using only 10 threads, instead > > of 16 (our machines have 16 cores) to avoid being killed by slurm. > > --- > > Thanks in advance for any suggestions. > > > > Sergey > > > > What is your memory limit configuration in slurm? Anyway, a few things to > check: > > - Make sure you're not limiting RLIMIT_AS in any way (e.g. run "ulimit -v" in > your batch script, ensure it's unlimited. In the slurm config, ensure > VSizeFactor=0). > - Are you using task/cgroup for limiting memory? In that case the problem > might be that cgroup memory limits work with RSS, and as you're running > multiple > processes the shared mmap'ed file will be counted multiple times. There's no > really good way around this, but with, say, something like > > ConstrainRAMSpace=no > ConstrainSwapSpace=yes > AllowedRAMSpace=100 > AllowedSwapSpace=1600 > you'll get a setup where the cgroup soft limit will be set to the amount your > job allocates, but the hard limit (where the job will be killed) will be set > to > 1600% of that. > - If you're using cgroups for memory limits, you should also set > JobAcctGatherParams=NoOverMemoryKill > - If you're NOT using cgroups for memory limits, try setting > JobAcctGatherParams=UsePSS which should avoiding counting the shared mappings > multiple times. >