Generally, the troubleshooting steps which you should take for Slurm are:

squeue to look at the list of running/queued or held jobs

sinfo to show which nodes are idle, busy or down

scontrol show node  to get more detailed information on a node

For problem nodes - indeed just log into any node to see what a healthy
node looks like
systemctl status slurmd
cat /var/log/slurm/slurmd.log

On your slurm controller look at the slurmctld and slurmdbd logs




On Tue, 7 Jan 2025 at 11:30, sportlecon sportlecon via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> slurm 24.11 - squeue displays  reason "launch failed requeued held"
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to