ps -eaf --forest is your friend with Slurm On Mon, Feb 10, 2025, 12:08 PM Michał Kadlof via slurm-users < slurm-users@lists.schedmd.com> wrote:
> I observed similar symptoms when we had issues with the shared Lustre file > system. When the file system couldn't complete an I/O operation, the > process in Slurm remained in the CG state until the file system became > responsive again. An additional symptom was that the blocking process was > stuck in the D state. > On 10/02/2025 09:28, Ricardo Román-Brenes via slurm-users wrote: > > Hello everyone. > > I have a cluster composed of 16 nodes, with 4 of them having GPUs with no > particular configuration to manage them. > The filesystem is gluster, authentication via slapd/munge. > > My problem is that very frequently, let's say at least a job daily, gets > stuck in CG. I have no idea why this happens. Manually killing the > slurmstep process releases the node but this is in no way a manageable > solution. Has anyone experienced this (and fixed it?) > > Thank you. > > -Ricardo > > -- > best regards | pozdrawiam serdecznie > *Michał Kadlof* > Head of the high performance computing center Kierownik ośrodka > obliczeniowego HPC > EdenN cluster administrator Administrator klastra obliczeniowego EdenN > Structural and Functional Genomics Laboratory Laboratorium Genomiki > Strukturalnej i Funkcjonalnej > Faculty of Mathematics and Computer Science Wydział Matematyki i Nauk > Informacyjnych > Warsaw University of Technology Politechnika Warszawska > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com