This is only so relevant, but the scenario presents itself similarly. This is not in a scheduler environment, but we have an interactive server that would have PS hangs on certain tasks (top -bn1 is a way around that, BTW, if it’s hard to even find out what the process is). For us, it appeared to be a process that was using a lot of memory that khugepaged was attempting to manipulate.
https://access.redhat.com/solutions/46111 I have never seen this happen on 7.x that I can recall. On our 6.x machine where we’ve seen it happen, all we did was this: echo "madvise" > /sys/kernel/mm/redhat_transparent_hugepage/defrag …in /etc/rc.local (which I hate, but I’m not sure where else that can go — maybe on the boot command line). This prevented nearly 100% of our problems. No idea if that has anything to do with your situation. > On Nov 29, 2018, at 1:27 PM, Christopher Benjamin Coffey > <chris.cof...@nau.edu> wrote: > > Hi, > > We've been noticing an issue with nodes from time to time that become > "wedged", or unusable. This is a state where ps, and w hang. We've been > looking into this for a while when we get time and finally put some more > effort into it yesterday. We came across this blog which describes almost the > exact scenario: > > https://rachelbythebay.com/w/2014/10/27/ps/ > > It has nothing to do with Slurm, but it does have to do with cgroups which we > have enabled. It appears that processes that have hit their ceiling for > memory and should be killed by oom-killer, and are in D state at the same > time, cause the system to become wedged. For each node wedged, I've found a > job out in: > > /cgroup/memory/slurm/uid_3665/job_15363106/step_batch > - memory.max_usage_in_bytes > - memory.limit_in_bytes > > The two files are the same bytes, which I'd think would be a candidate for > oom-killer. But memory.oom_control says: > > oom_kill_disable 0 > under_oom 0 > > My feeling is that the process was in D state, the oom-killer tried to be > invoked, but then didn't and the system became wedged. > > Has anyone run into this? If so, whats the fix? Apologies if this has been > discussed before, I haven't noticed it on the group. > > I wonder if it’s a bug in the oom-killer? Maybe it's been patched in a more > recent kernel but looking at the kernels in the 6.10 series it doesn't look > like a newer one would have a patch for a oom-killer bug. > > Our setup is: > > Centos 6.10 > 2.6.32-642.6.2.el6.x86_64 > Slurm 17.11.12 > > And /etc/slurm/cgroup.conf > ConstrainCores=yes > ConstrainRAMSpace=yes > ConstrainSwapSpace=yes > > Cheers, > Chris > > — > Christopher Coffey > High-Performance Computing > Northern Arizona University > 928-523-1167 > > -- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `'