[slurm-users] Re: jobs getting stuck in CG

Michał Kadlof via slurm-users Mon, 10 Feb 2025 04:06:42 -0800

I observed similar symptoms when we had issues with the shared Lustre file system. When the file system couldn't complete an I/O operation, the process in Slurm remained in the CG state until the file system became responsive again. An additional symptom was that the blocking process was stuck in the D state.


On 10/02/2025 09:28, Ricardo Román-Brenes via slurm-users wrote:

Hello everyone.
I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them.
The filesystem is gluster, authentication via slapd/munge.
My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?)
Thank you.

-Ricardo

--
best regards | pozdrawiam serdecznie
*Michał Kadlof*

Head of the high performance computing center Kierownik ośrodka obliczeniowego HPC

Eden^N cluster administrator    Administrator klastra obliczeniowego Eden^N

Structural and Functional Genomics Laboratory Laboratorium Genomiki Strukturalnej i Funkcjonalnej Faculty of Mathematics and Computer Science Wydział Matematyki i Nauk Informacyjnych

Warsaw University of Technology         Politechnika Warszawska

smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: jobs getting stuck in CG

Reply via email to