Re: [slurm-users] oom-kill events for no good reason

Marcus Wagner Fri, 08 Nov 2019 05:05:58 -0800

Hi David,

yes, I see these messages also. I also think, this is more likely awrong message. If a job has been cancelled by the OOM-Killer, you cansee this with sacct, e.g.

$> sacct -j 10816098

JobID JobName Partition Account AllocCPUS StateExitCode------------ ---------- ---------- ---------- ---------- ------------------10816098 VASP_MPI c18m default 12 OUT_OF_ME+ 0:12510816098.ba+ batch default 12 OUT_OF_ME+ 0:125

10816098.ex+     extern               default         12 COMPLETED      0:0

10816098.0 vasp_mpi default 12 OUT_OF_ME+ 0:125


Best
Marcus

On 11/7/19 5:36 PM, David Baker wrote:

Hello,
We are dealing with some weird issue on our shared nodes where jobappear to be stalling for some reason. I was advised that this issuemight be related to the oom-killer process. We do see a lot of theseevents. In fact when I started to take a closer look this afternoon Inoticed that all jobs on all nodes (not just the shared nodes) are"firing" the oom-killer for some reason when they finish.
As a demo I launched a very simple (low memory usage) test jobs on ashared node and then after a few minutes cancelled it to show thebehaviour. Looking in the slurmd.log -- see below -- we see theoom-killer being fired for no good reason. This "feels" vaguelysimilar to this bug --https://bugs.schedmd.com/show_bug.cgi?id=5121 which I understand waspatched back in SLURM v17 (we are using v18*).
Has anyone else seen this behaviour? Or more to the point does anyoneunderstand this behaviour and know how to squash it, please?
Best regards,
David

[2019-11-07T16:14:52.551] Launching batch job 164978 for UID 57337
[2019-11-07T16:14:52.559] [164977.batch] task/cgroup:/slurm/uid_57337/job_164977: alloc=23640MB mem.limit=23640MBmemsw.limit=unlimited[2019-11-07T16:14:52.560] [164977.batch] task/cgroup:/slurm/uid_57337/job_164977/step_batch: alloc=23640MBmem.limit=23640MB memsw.limit=unlimited[2019-11-07T16:14:52.584] [164978.batch] task/cgroup:/slurm/uid_57337/job_164978: alloc=23640MB mem.limit=23640MBmemsw.limit=unlimited[2019-11-07T16:14:52.584] [164978.batch] task/cgroup:/slurm/uid_57337/job_164978/step_batch: alloc=23640MBmem.limit=23640MB memsw.limit=unlimited[2019-11-07T16:14:52.960] [164977.batch] task_p_pre_launch: Usingsched_affinity for tasks[2019-11-07T16:14:52.960] [164978.batch] task_p_pre_launch: Usingsched_affinity for tasks[2019-11-07T16:16:05.859] [164977.batch] error: *** JOB 164977 ONgold57 CANCELLED AT 2019-11-07T16:16:05 ***[2019-11-07T16:16:05.882] [164977.extern] *_oom_event_monitor:oom-kill event count: 1*
[2019-11-07T16:16:05.886] [164977.extern] done with job


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Re: [slurm-users] oom-kill events for no good reason

Reply via email to