Generally, the troubleshooting steps which you should take for Slurm are: squeue to look at the list of running/queued or held jobs
sinfo to show which nodes are idle, busy or down scontrol show node to get more detailed information on a node For problem nodes - indeed just log into any node to see what a healthy node looks like systemctl status slurmd cat /var/log/slurm/slurmd.log On your slurm controller look at the slurmctld and slurmdbd logs On Tue, 7 Jan 2025 at 11:30, sportlecon sportlecon via slurm-users < slurm-users@lists.schedmd.com> wrote: > slurm 24.11 - squeue displays reason "launch failed requeued held" > > -- > slurm-users mailing list -- slurm-users@lists.schedmd.com > To unsubscribe send an email to slurm-users-le...@lists.schedmd.com >
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com