Belay that reply. Different issue. In that case salloc works OK but stun says user has no job on the node
On Mon, Feb 10, 2025, 9:24 AM John Hearns <hear...@gmail.com> wrote: > I have had something similar. > The fix was to run a > scontrol reconfig > Which causes a reread of the Slurmd config > Give that a try > > It might be scontrol reread. Use the manual > > On Mon, Feb 10, 2025, 8:32 AM Ricardo Román-Brenes via slurm-users < > slurm-users@lists.schedmd.com> wrote: > >> Hello everyone. >> >> I have a cluster composed of 16 nodes, with 4 of them having GPUs with no >> particular configuration to manage them. >> The filesystem is gluster, authentication via slapd/munge. >> >> My problem is that very frequently, let's say at least a job daily, gets >> stuck in CG. I have no idea why this happens. Manually killing the >> slurmstep process releases the node but this is in no way a manageable >> solution. Has anyone experienced this (and fixed it?) >> >> Thank you. >> >> -Ricardo >> >> -- >> slurm-users mailing list -- slurm-users@lists.schedmd.com >> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com >> >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com