On 10/02/2025 09:28, Ricardo Román-Brenes via slurm-users wrote:
Hello everyone.I have a cluster composed of 16 nodes, with 4 of them having GPUs with no particular configuration to manage them.The filesystem is gluster, authentication via slapd/munge.My problem is that very frequently, let's say at least a job daily, gets stuck in CG. I have no idea why this happens. Manually killing the slurmstep process releases the node but this is in no way a manageable solution. Has anyone experienced this (and fixed it?)Thank you. -Ricardo
-- best regards | pozdrawiam serdecznie *Michał Kadlof*Head of the high performance computing center Kierownik ośrodka obliczeniowego HPC
Eden^N cluster administrator Administrator klastra obliczeniowego Eden^NStructural and Functional Genomics Laboratory Laboratorium Genomiki Strukturalnej i Funkcjonalnej Faculty of Mathematics and Computer Science Wydział Matematyki i Nauk Informacyjnych
Warsaw University of Technology Politechnika Warszawska
smime.p7s
Description: S/MIME Cryptographic Signature
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com