Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

Ryan Novosielski Fri, 07 Dec 2018 13:29:39 -0800

This is only so relevant, but the scenario presents itself similarly. This is 
not in a scheduler environment, but we have an interactive server that would 
have PS hangs on certain tasks (top -bn1 is a way around that, BTW, if it’s 
hard to even find out what the process is). For us, it appeared to be a process 
that was using a lot of memory that khugepaged was attempting to manipulate.


https://access.redhat.com/solutions/46111

I have never seen this happen on 7.x that I can recall. On our 6.x machine 
where we’ve seen it happen, all we did was this:

echo "madvise" > /sys/kernel/mm/redhat_transparent_hugepage/defrag

…in /etc/rc.local (which I hate, but I’m not sure where else that can go — 
maybe on the boot command line). This prevented nearly 100% of our problems.

No idea if that has anything to do with your situation.

> On Nov 29, 2018, at 1:27 PM, Christopher Benjamin Coffey 
> <chris.cof...@nau.edu> wrote:
> 
> Hi,
> 
> We've been noticing an issue with nodes from time to time that become 
> "wedged", or unusable. This is a state where ps, and w hang. We've been 
> looking into this for a while when we get time and finally put some more 
> effort into it yesterday. We came across this blog which describes almost the 
> exact scenario:
> 
> https://rachelbythebay.com/w/2014/10/27/ps/
> 
> It has nothing to do with Slurm, but it does have to do with cgroups which we 
> have enabled. It appears that processes that have hit their ceiling for 
> memory and should be killed by oom-killer, and are in D state at the same 
> time, cause the system to become wedged. For each node wedged, I've found a 
> job out in:
> 
> /cgroup/memory/slurm/uid_3665/job_15363106/step_batch
> - memory.max_usage_in_bytes
> - memory.limit_in_bytes
> 
> The two files are the same bytes, which I'd think would be a candidate for 
> oom-killer. But memory.oom_control says:
> 
> oom_kill_disable 0
> under_oom 0
> 
> My feeling is that the process was in D state, the oom-killer tried to be 
> invoked, but then didn't and the system became wedged.
> 
> Has anyone run into this? If so, whats the fix? Apologies if this has been 
> discussed before, I haven't noticed it on the group.
> 
> I wonder if it’s a bug in the oom-killer? Maybe it's been patched in a more 
> recent kernel but looking at the kernels in the 6.10 series it doesn't look 
> like a newer one would have a patch for a oom-killer bug.
> 
> Our setup is:
> 
> Centos 6.10
> 2.6.32-642.6.2.el6.x86_64
> Slurm 17.11.12
> 
> And /etc/slurm/cgroup.conf
> ConstrainCores=yes
> ConstrainRAMSpace=yes
> ConstrainSwapSpace=yes
> 
> Cheers,
> Chris
> 
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
> 
> 

--
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
     `'

Re: [slurm-users] Wedged nodes from cgroups, OOM killer, and D state process

Reply via email to