On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote:

I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck in the D state.

We've seen the same behaviour, though for us we use an "UnkillableStepProgram" to deal with compute nodes where user processes (as opposed to Slurm daemons, which sounds like the issue for the original poster here) get stuck and are unkillable.

Our script does things like "echo w > /proc/sysrq-trigger" to get the kernel to dump its view of all stuck processes and then it goes through the stuck jobs cgroup to find all the processes and dump /proc/$PID/stack for each process and then thread it finds there.

In the end it either marks the node down (if it's the only job on the node which will mark the job as complete in Slurm, though will not free up those stuck processes) or drains the node if it's running multiple jobs. In both cases we'll come back and check the issue out (and our SREs will wake us up if they think there's an unusual number of these).

That final step is important because a node stuck completing can really confuse backfill scheduling for us as slurmctld assumes it will become free any second now and try and use the node for planning jobs, despite it being stuck. So marking it down/drain gets it out of slurmctld's view as a potential future node.

For nodes where a Slurm daemon on the node is stuck that script will not fire and so our SRE's have alarms that trip after a node has been completing for longer than a certain amount of time. They go and look at what's going on and get the node out of the system before utilisation collapses (and wake us up if that number seems to be increasing).

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to