Re: [slurm-users] Troubles with cgroups

2023-05-17 Thread Hermann Schwärzler
Hi everybody, I would like to give you a quick update on this problem (hanging systems when swapping due to cgroup memory-limits is happening): We had opened a case with RedHat's customer support. After some to and fro they could reproduce the problem. Last week they told us to upgrade to ve

Re: [slurm-users] Troubles with cgroups

2023-03-21 Thread Jason Simms
Hello Hermann, Thanks for following up about this. What you say makes sense: at Lafayette, we didn't experience the issue until upgrading to a Slurm version that supported cgroups/v2, and here at Swarthmore, we are still on a version of Slurm that doesn't and we don't have the issue (both Rocky 8)

Re: [slurm-users] Troubles with cgroups

2023-03-21 Thread Hermann Schwärzler
Hi Jason, thank you for your reply. From what I can tell your problem *is* the same as ours. BTW: we were already talking about disabling swap in our nodes as a last resort. :-) In the meantime we made some new findings: we can trigger the error when (with cgroups/v2) we set memory.high and m

Re: [slurm-users] Troubles with cgroups

2023-03-17 Thread Jason Simms
Hello, This isn't precisely related, but I can say that we were having strange issues with system load spiking to the point that the nodes became unresponsive and likewise needed a hard reboot. After several tests and working with our vendor, on nodes that we entirely disabled swap, the problems c

[slurm-users] Troubles with cgroups

2023-03-16 Thread Hermann Schwärzler
Dear Slurm users, after opening our new cluster (62 nodes - 250 GB RAM, 64 cores each - Rocky Linux 8.6 - Kernel 4.18.0-372.16.1.el8_6.0.1 - Slurm 22.05) for "friendly user" test operation about 6 weeks ago we were soon facing serious problems with nodes that suddenly become unresponsive (so m