Hello, Thank you all for your useful replies. I double checked that the oom-killer "fires" at the end of every job on our cluster. As you mention this isn't significant and not something to be concerned about.
Best regards, David ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Marcus Wagner <wag...@itc.rwth-aachen.de> Sent: 08 November 2019 13:00 To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] oom-kill events for no good reason Hi David, yes, I see these messages also. I also think, this is more likely a wrong message. If a job has been cancelled by the OOM-Killer, you can see this with sacct, e.g. $> sacct -j 10816098 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 10816098 VASP_MPI c18m default 12 OUT_OF_ME+ 0:125 10816098.ba+ batch default 12 OUT_OF_ME+ 0:125 10816098.ex+ extern default 12 COMPLETED 0:0 10816098.0 vasp_mpi default 12 OUT_OF_ME+ 0:125 Best Marcus On 11/7/19 5:36 PM, David Baker wrote: Hello, We are dealing with some weird issue on our shared nodes where job appear to be stalling for some reason. I was advised that this issue might be related to the oom-killer process. We do see a lot of these events. In fact when I started to take a closer look this afternoon I noticed that all jobs on all nodes (not just the shared nodes) are "firing" the oom-killer for some reason when they finish. As a demo I launched a very simple (low memory usage) test jobs on a shared node and then after a few minutes cancelled it to show the behaviour. Looking in the slurmd.log -- see below -- we see the oom-killer being fired for no good reason. This "feels" vaguely similar to this bug -- https://bugs.schedmd.com/show_bug.cgi?id=5121<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D5121&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cb280bfbe58bb495bbace08d7644c9e52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=g%2BT6zIZqTr8ZAi52RgFRaMViwdxZPjkEOkvNa6YEXRU%3D&reserved=0> which I understand was patched back in SLURM v17 (we are using v18*). Has anyone else seen this behaviour? Or more to the point does anyone understand this behaviour and know how to squash it, please? Best regards, David [2019-11-07T16:14:52.551] Launching batch job 164978 for UID 57337 [2019-11-07T16:14:52.559] [164977.batch] task/cgroup: /slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.560] [164977.batch] task/cgroup: /slurm/uid_57337/job_164977/step_batch: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.584] [164978.batch] task/cgroup: /slurm/uid_57337/job_164978: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.584] [164978.batch] task/cgroup: /slurm/uid_57337/job_164978/step_batch: alloc=23640MB mem.limit=23640MB memsw.limit=unlimited [2019-11-07T16:14:52.960] [164977.batch] task_p_pre_launch: Using sched_affinity for tasks [2019-11-07T16:14:52.960] [164978.batch] task_p_pre_launch: Using sched_affinity for tasks [2019-11-07T16:16:05.859] [164977.batch] error: *** JOB 164977 ON gold57 CANCELLED AT 2019-11-07T16:16:05 *** [2019-11-07T16:16:05.882] [164977.extern] _oom_event_monitor: oom-kill event count: 1 [2019-11-07T16:16:05.886] [164977.extern] done with job -- Marcus Wagner, Dipl.-Inf. IT Center Abteilung: Systeme und Betrieb RWTH Aachen University Seffenter Weg 23 52074 Aachen Tel: +49 241 80-24383 Fax: +49 241 80-624383 wag...@itc.rwth-aachen.de<mailto:wag...@itc.rwth-aachen.de> www.itc.rwth-aachen.de<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.itc.rwth-aachen.de&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cb280bfbe58bb495bbace08d7644c9e52%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=%2Bk3%2BvCTzz%2ByeelQ96SPB5N0EoXCtWp0mrX9pFrUsHHk%3D&reserved=0>