Also consider any cached information, e.g. nfs . You won't necessarily see this, but might be getting accounted for in the cgroup, depending on your setup/settings.
-----Original Message----- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Loris Bennett Sent: 14 February 2018 12:06 To: Geert Kapteijns <ghkaptei...@gmail.com> Cc: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] slurmstepd: error: Exceeded job memory limit at some point. Geert Kapteijns <ghkaptei...@gmail.com> writes: > Hi everyone, > > I’m running into out-of-memory errors when I specify an array job. > Needless to say, 100M should be more than enough, and increasing the > allocated memory to 1G doesn't solve the problem. I call my script as > follows: sbatch --array=100-199 run_batch_job. run_batch_job contains > > #!/bin/env bash > #SBATCH --partition=lln > #SBATCH --output=/home/user/outs/%x.out.%a > #SBATCH --error=/home/user/outs/%x.err.%a #SBATCH --cpus-per-task=1 > #SBATCH --mem-per-cpu=100M #SBATCH --time=2-00:00:00 > > srun my_program.out $SLURM_ARRAY_TASK_ID > > Instead of using --mem-per-cpu and --cpus-per-task, I’ve also tried the > following: > > #SBATCH --mem=100M > #SBATCH --ntasks=1 # Number of cores > #SBATCH --nodes=1 # All cores on one machine > > But in both cases for some of the runs, I get the error: > > slurmstepd: error: Exceeded job memory limit at some point. > srun: error: obelix-cn002: task 0: Out Of Memory > slurmstepd: error: Exceeded job memory limit at some point. > > I’ve also posted the question on stackoverflow. Does anyone know what is > happening here? Maybe once in a while a simulation really does just use more memory than you were expecting. Have a look at the output of sacct -j 123456 -o jobid,maxrss,state --units=M with the appropriate job ID. Regards Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de