On 10/25/2018 07:00 AM, Christopher Samuel wrote:
On 25/10/18 2:29 pm, Christopher Samuel wrote:
Could explain why this isn't something we see consistently, and why
we're both seeing it currently.
This seems to be a handy way to find any processes that are not properly
constrained by Slurm cgroups on compute nodes (at least in our
configuration):
ps --no-headers -eo pid,user,comm,cgroup | egrep -vw
'root|freezer:/slurm.*devices:/slurm.*cpuacct,cpu:/slurm.*memory:/slurm|cpuset:/slurm.*|dbus-daemon|munged|ntpd|gmond|polkitd'
Nice command, Chris! I added a couple of usernames from CentOS 7 as
seen below. However, defunct processes seem to escape cgroups, for example:
# ps --no-headers -eo pid,user,comm,cgroup | egrep -vw
'root|freezer:/slurm.*devices:/slurm.*cpuacct,cpu:/slurm.*memory:/slurm|cpuset:/slurm.*|dbus-daemon|munged|ntpd|gmond|polkitd|chrony|smmsp|rpcuser|rpc'
27312 jhwa mpiex <defunct> -
What should we do about defunct processes and cgroups?
/Ole