Hi everybody,
I would like to give you a quick update on this problem (hanging systems
when swapping due to cgroup memory-limits is happening):
We had opened a case with RedHat's customer support. After some to and
fro they could reproduce the problem. Last week they told us to upgrade
to ve
Hello Hermann,
Thanks for following up about this. What you say makes sense: at Lafayette,
we didn't experience the issue until upgrading to a Slurm version that
supported cgroups/v2, and here at Swarthmore, we are still on a version of
Slurm that doesn't and we don't have the issue (both Rocky 8)
Hi Jason,
thank you for your reply.
From what I can tell your problem *is* the same as ours. BTW: we were
already talking about disabling swap in our nodes as a last resort. :-)
In the meantime we made some new findings: we can trigger the error when
(with cgroups/v2) we set memory.high and m
Hello,
This isn't precisely related, but I can say that we were having strange
issues with system load spiking to the point that the nodes became
unresponsive and likewise needed a hard reboot. After several tests and
working with our vendor, on nodes that we entirely disabled swap, the
problems c
Dear Slurm users,
after opening our new cluster (62 nodes - 250 GB RAM, 64 cores each -
Rocky Linux 8.6 - Kernel 4.18.0-372.16.1.el8_6.0.1 - Slurm 22.05) for
"friendly user" test operation about 6 weeks ago we were soon facing
serious problems with nodes that suddenly become unresponsive (so m