On 11/7/19 8:36 AM, David Baker wrote:

We are dealing with some weird issue on our shared nodes where job appear to be stalling for some reason. I was advised that this issue might be related to the oom-killer process. We do see a lot of these events. In fact when I started to take a closer look this afternoon I noticed that all jobs on all nodes (not just the shared nodes) are "firing" the oom-killer for some reason when they finish.

You should see the reason the OOM killer fired in "dmesg".

Do note though that it's not the main job step that's reporting that, it's the extern step.

If there's nothing there about the OOM killer then the message you see is likely wrong - from memory Slurm has a file descriptor where it should receive notifications of OOM killer events and so that should only increment when the kernel reports something on it.

We're seeing something similar here, but only for the external step (which seems to be what you're seeing too).

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Reply via email to