[slurm-users] Re: jobs getting stuck in CG

John Hearns via slurm-users Mon, 10 Feb 2025 01:31:01 -0800

Belay that reply. Different issue.
In that case salloc works OK but stun says user has no job on the node


On Mon, Feb 10, 2025, 9:24 AM John Hearns <hear...@gmail.com> wrote:

> I have had something similar.
> The fix was to run a
> scontrol reconfig
> Which causes a reread of the Slurmd config
> Give that a try
>
> It might be scontrol reread. Use the manual
>
> On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> Hello everyone.
>>
>> I have a cluster composed of 16 nodes, with 4 of them having GPUs with no
>> particular configuration to manage them.
>> The filesystem is gluster, authentication via slapd/munge.
>>
>> My problem is that very frequently, let's say at least a job daily, gets
>> stuck in CG. I have no idea why this happens. Manually killing the
>> slurmstep process releases the node but this is in no way a manageable
>> solution. Has anyone experienced this (and fixed it?)
>>
>> Thank you.
>>
>> -Ricardo
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: jobs getting stuck in CG

Reply via email to