Re: [slurm-users] Slurm queue seems to be completely blocked

Alex Chekholko Mon, 11 May 2020 10:17:42 -0700

You will want to look at the output of 'sinfo' and 'scontrol show node' to
see what slurmctld thinks about your compute nodes; then on the compute
nodes you will want to check the status of the slurmd service ('systemctl
status -l slurmd') and possibly read through the slurmd logs as well.


On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.h...@gmail.com> wrote:

> Hello;
>
> I am in the process of familiarizing myself with slurm - I will write a
> piece of software which will submit jobs to a slurm cluster. Right now I
> have just made my own "cluster" consisting of one Amazon AWS node and use
> that to familiarize myself with the sxxx commands - has worked nicely.
>
> Now I just brought this AWS node completely to it's knees (not slurm
> related) and had to stop and start the node from the AWS console - during
> that process a job managed by slurm was killed hard. Now when the node is
> back up again slurm refuses to start up jobs - the queue looks like this:
>
> ubuntu@ip-172-31-80-232:~$ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                186     debug tmp-file www-data PD       0:00      1
> (Resources)
>                187     debug tmp-file www-data PD       0:00      1
> (Resources)
>                188     debug tmp-file www-data PD       0:00      1
> (Resources)
>                189     debug tmp-file www-data PD       0:00      1
> (Resources)
>
> I.e. the jobs are pending due to Resource reasons, but no jobs are
> running? I have tried scancel all jobs, but when I add new jobs they again
> just stay pending. It should be said that when the node/slurm came back up
> again the offending job which initially created the havoc was still in
> "Running" state, but the filesystem of that job had been completely wiped
> so it was not in a sane state. scancel of this job worked fine - but no new
> jobs will start. Seems like there is "ghost job" blocking the other jobs
> from starting? I even tried to reinstall slurm using the package manager,
> but the new slurm installation would still not start jobs. Any tips on how
> I can proceed to debug this?
>
> Regards
>
> Joakim
>

Re: [slurm-users] Slurm queue seems to be completely blocked

Reply via email to