Re: [slurm-users] oom-kill events for no good reason

Christopher Samuel Thu, 07 Nov 2019 10:18:00 -0800

On 11/7/19 8:36 AM, David Baker wrote:

We are dealing with some weird issue on our shared nodes where jobappear to be stalling for some reason. I was advised that this issuemight be related to the oom-killer process. We do see a lot of theseevents. In fact when I started to take a closer look this afternoon Inoticed that all jobs on all nodes (not just the shared nodes) are"firing" the oom-killer for some reason when they finish.


You should see the reason the OOM killer fired in "dmesg".

Do note though that it's not the main job step that's reporting that,it's the extern step.

If there's nothing there about the OOM killer then the message you seeis likely wrong - from memory Slurm has a file descriptor where itshould receive notifications of OOM killer events and so that shouldonly increment when the kernel reports something on it.

We're seeing something similar here, but only for the external step(which seems to be what you're seeing too).


All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Re: [slurm-users] oom-kill events for no good reason

Reply via email to