[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread Christopher Samuel via slurm-users
On 2/10/25 7:05 am, Michał Kadlof via slurm-users wrote: I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An a

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
ps -eaf --forest is your friend with Slurm On Mon, Feb 10, 2025, 12:08 PM Michał Kadlof via slurm-users < slurm-users@lists.schedmd.com> wrote: > I observed similar symptoms when we had issues with the shared Lustre file > system. When the file system couldn't complete an I/O operation, the > pro

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread Michał Kadlof via slurm-users
I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
Belay that reply. Different issue. In that case salloc works OK but stun says user has no job on the node On Mon, Feb 10, 2025, 9:24 AM John Hearns wrote: > I have had something similar. > The fix was to run a > scontrol reconfig > Which causes a reread of the Slurmd config > Give that a try > >

[slurm-users] Re: jobs getting stuck in CG

2025-02-10 Thread John Hearns via slurm-users
I have had something similar. The fix was to run a scontrol reconfig Which causes a reread of the Slurmd config Give that a try It might be scontrol reread. Use the manual On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users < slurm-users@lists.schedmd.com> wrote: > Hello everyone.