You will want to look at the output of 'sinfo' and 'scontrol show node' to see what slurmctld thinks about your compute nodes; then on the compute nodes you will want to check the status of the slurmd service ('systemctl status -l slurmd') and possibly read through the slurmd logs as well.
On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.h...@gmail.com> wrote: > Hello; > > I am in the process of familiarizing myself with slurm - I will write a > piece of software which will submit jobs to a slurm cluster. Right now I > have just made my own "cluster" consisting of one Amazon AWS node and use > that to familiarize myself with the sxxx commands - has worked nicely. > > Now I just brought this AWS node completely to it's knees (not slurm > related) and had to stop and start the node from the AWS console - during > that process a job managed by slurm was killed hard. Now when the node is > back up again slurm refuses to start up jobs - the queue looks like this: > > ubuntu@ip-172-31-80-232:~$ squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) > 186 debug tmp-file www-data PD 0:00 1 > (Resources) > 187 debug tmp-file www-data PD 0:00 1 > (Resources) > 188 debug tmp-file www-data PD 0:00 1 > (Resources) > 189 debug tmp-file www-data PD 0:00 1 > (Resources) > > I.e. the jobs are pending due to Resource reasons, but no jobs are > running? I have tried scancel all jobs, but when I add new jobs they again > just stay pending. It should be said that when the node/slurm came back up > again the offending job which initially created the havoc was still in > "Running" state, but the filesystem of that job had been completely wiped > so it was not in a sane state. scancel of this job worked fine - but no new > jobs will start. Seems like there is "ghost job" blocking the other jobs > from starting? I even tried to reinstall slurm using the package manager, > but the new slurm installation would still not start jobs. Any tips on how > I can proceed to debug this? > > Regards > > Joakim >