On 2018-09-07 18:53, Mike Cammilleri wrote:
Hi everyone,
I'm getting this error lately for everyone's jobs, which results in memory not
being constrained via the cgroups plugin.
slurmstepd: error: task/cgroup: unable to add task[pid=21681] to memory cg
slurmstepd: error: jobacct_gather/cgroup: unable to instanciate user 3691
memory cgroup
The result is that no uid_ direcotries are created under /sys/fs/cgroup/memory
Here is our cgroup.conf file:
We are using jobacct_gather/cgroup
The partition is configured like this
PartitionName=long Nodes=marzano[05-13] PriorityTier=30 Default=NO MaxTime=5-0
State=UP OverSubscribe=FORCE:1
We are using slurm 16.05.6 on Ubuntu 14.04 LTS
Any ideas how to get cgroups going again?
This is, apparently, a bug in the Linux kernel where it doesn't garbage
collect deleted memory cgroups. Eventually the kernel hits an internal
limit on how many memory cgroups there can be, and refuses to create more.
This bug has apparently been fixed in the upstream kernel, but is still
present at least in the CentOS 7 kernel, and based on your report, in
the Ubuntu 14.04 kernel.
One workaround is to reboot the node whenever this happens. Another is
to set ConstrainKmemSpace=no is cgroup.conf (but AFAICS this option was
added in slurm 17.02 and is not present in 16.05 that you're using).
For more information, see discussion and links in slurm bug #5082.
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi