[slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

Christopher Benjamin Coffey Thu, 29 Nov 2018 10:30:02 -0800

Hi,

We've been noticing an issue with nodes from time to time that become "wedged", 
or unusable. This is a state where ps, and w hang. We've been looking into this 
for a while when we get time and finally put some more effort into it 
yesterday. We came across this blog which describes almost the exact scenario:


https://rachelbythebay.com/w/2014/10/27/ps/

It has nothing to do with Slurm, but it does have to do with cgroups which we 
have enabled. It appears that processes that have hit their ceiling for memory 
and should be killed by oom-killer, and are in D state at the same time, cause 
the system to become wedged. For each node wedged, I've found a job out in:

/cgroup/memory/slurm/uid_3665/job_15363106/step_batch
- memory.max_usage_in_bytes
- memory.limit_in_bytes

The two files are the same bytes, which I'd think would be a candidate for 
oom-killer. But memory.oom_control says:

oom_kill_disable 0
under_oom 0

My feeling is that the process was in D state, the oom-killer tried to be 
invoked, but then didn't and the system became wedged.

Has anyone run into this? If so, whats the fix? Apologies if this has been 
discussed before, I haven't noticed it on the group.

I wonder if it’s a bug in the oom-killer? Maybe it's been patched in a more 
recent kernel but looking at the kernels in the 6.10 series it doesn't look 
like a newer one would have a patch for a oom-killer bug.

Our setup is:

Centos 6.10
2.6.32-642.6.2.el6.x86_64
Slurm 17.11.12

And /etc/slurm/cgroup.conf
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

Cheers,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

[slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

Reply via email to