I have had something similar.
The fix was to run a
scontrol reconfig
Which causes a reread of the Slurmd config
Give that a try

It might be scontrol reread. Use the manual

On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Hello everyone.
>
> I have a cluster composed of 16 nodes, with 4 of them having GPUs with no
> particular configuration to manage them.
> The filesystem is gluster, authentication via slapd/munge.
>
> My problem is that very frequently, let's say at least a job daily, gets
> stuck in CG. I have no idea why this happens. Manually killing the
> slurmstep process releases the node but this is in no way a manageable
> solution. Has anyone experienced this (and fixed it?)
>
> Thank you.
>
> -Ricardo
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to