Generally, the troubleshooting steps which you should take for Slurm are:
squeue to look at the list of running/queued or held jobs
sinfo to show which nodes are idle, busy or down
scontrol show node to get more detailed information on a node
For problem nodes - indeed just log into any node t
You need to find the node which the job started on.
Then look at the slurmd log on that node. You may find an indication of the
reason for the failure.
On Tue, 7 Jan 2025 at 11:30, sportlecon sportlecon via slurm-users <
slurm-users@lists.schedmd.com> wrote:
> slurm 24.11 - squeue displays reaso