Hey guys, When we upgraded to 17.11.7, then on some clusters all jobs are killed with these messages:
slurmstepd: error: Job 374 exceeded memory limit (1308 > 1024), being killed slurmstepd: error: Exceeded job memory limit slurmstepd: error: *** JOB 374 ON node002 CANCELLED AT 2018-06-28T04:40:28 *** The thing is DefMemPerCPU and DefMemPerNode are set to UNLIMITED, MemLimitEnforce=YES. Users did not set memory limits for their jobs. The error messages above point to the fact that DefMemPerCPU is reset to 1 somehow (my guess). I noticed the commit that can be related to this: https://github.com/SchedMD/slurm/commit/bf4cb0b1b01f3e165bf12e69fe59aa7b222f8d8e What do you think? Best regards, Taras