[slurm-users] oom-kill events for no good reason

David Baker Thu, 07 Nov 2019 08:41:02 -0800

Hello,

We are dealing with some weird issue on our shared nodes where job appear to be 
stalling for some reason. I was advised that this issue might be related to the 
oom-killer process. We do see a lot of these events. In fact when I started to 
take a closer look this afternoon I noticed that all jobs on all nodes (not 
just the shared nodes) are "firing" the oom-killer for some reason when they 
finish.


As a demo I launched a very simple (low memory usage) test jobs on a shared 
node  and then after a few minutes cancelled it to show the behaviour. Looking 
in the slurmd.log -- see below -- we see the oom-killer being fired for no good 
reason. This "feels" vaguely similar to this bug -- 
https://bugs.schedmd.com/show_bug.cgi?id=5121 which I understand was patched 
back in SLURM v17 (we are using v18*).

Has anyone else seen this behaviour? Or more to the point does anyone 
understand this behaviour and know how to squash it, please?

Best regards,
David

[2019-11-07T16:14:52.551] Launching batch job 164978 for UID 57337
[2019-11-07T16:14:52.559] [164977.batch] task/cgroup: 
/slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.560] [164977.batch] task/cgroup: 
/slurm/uid_57337/job_164977/step_batch: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.584] [164978.batch] task/cgroup: 
/slurm/uid_57337/job_164978: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.584] [164978.batch] task/cgroup: 
/slurm/uid_57337/job_164978/step_batch: alloc=23640MB mem.limit=23640MB 
memsw.limit=unlimited
[2019-11-07T16:14:52.960] [164977.batch] task_p_pre_launch: Using 
sched_affinity for tasks
[2019-11-07T16:14:52.960] [164978.batch] task_p_pre_launch: Using 
sched_affinity for tasks
[2019-11-07T16:16:05.859] [164977.batch] error: *** JOB 164977 ON gold57 
CANCELLED AT 2019-11-07T16:16:05 ***
[2019-11-07T16:16:05.882] [164977.extern] _oom_event_monitor: oom-kill event 
count: 1
[2019-11-07T16:16:05.886] [164977.extern] done with job

[slurm-users] oom-kill events for no good reason

Reply via email to