ubuntu@ip-172-31-80-232:/var/run/slurm-llnl$ scontrol show node NodeName=ip-172-31-80-232 Arch=x86_64 CoresPerSocket=1 CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.00 AvailableFeatures=(null) ActiveFeatures=(null) Gres=(null) NodeAddr=ip-172-31-80-232 NodeHostName=ip-172-31-80-232 Version=17.11 OS=Linux 5.3.0-1017-aws #18~18.04.1-Ubuntu SMP Wed Apr 8 15:12:16 UTC 2020 RealMemory=983 AllocMem=0 FreeMem=355 Sockets=1 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=debug BootTime=2020-05-11T17:02:15 SlurmdStartTime=2020-05-11T17:02:27 CfgTRES=cpu=1,mem=983M,billing=1 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [root@2020-05-11T16:20:02]
The "State=IDLE+DRAIN" looks a bit suspicious? On Mon, May 11, 2020 at 7:16 PM Alex Chekholko <a...@calicolabs.com> wrote: > You will want to look at the output of 'sinfo' and 'scontrol show node' to > see what slurmctld thinks about your compute nodes; then on the compute > nodes you will want to check the status of the slurmd service ('systemctl > status -l slurmd') and possibly read through the slurmd logs as well. > > On Mon, May 11, 2020 at 10:11 AM Joakim Hove <joakim.h...@gmail.com> > wrote: > >> Hello; >> >> I am in the process of familiarizing myself with slurm - I will write a >> piece of software which will submit jobs to a slurm cluster. Right now I >> have just made my own "cluster" consisting of one Amazon AWS node and use >> that to familiarize myself with the sxxx commands - has worked nicely. >> >> Now I just brought this AWS node completely to it's knees (not slurm >> related) and had to stop and start the node from the AWS console - during >> that process a job managed by slurm was killed hard. Now when the node is >> back up again slurm refuses to start up jobs - the queue looks like this: >> >> ubuntu@ip-172-31-80-232:~$ squeue >> JOBID PARTITION NAME USER ST TIME NODES >> NODELIST(REASON) >> 186 debug tmp-file www-data PD 0:00 1 >> (Resources) >> 187 debug tmp-file www-data PD 0:00 1 >> (Resources) >> 188 debug tmp-file www-data PD 0:00 1 >> (Resources) >> 189 debug tmp-file www-data PD 0:00 1 >> (Resources) >> >> I.e. the jobs are pending due to Resource reasons, but no jobs are >> running? I have tried scancel all jobs, but when I add new jobs they again >> just stay pending. It should be said that when the node/slurm came back up >> again the offending job which initially created the havoc was still in >> "Running" state, but the filesystem of that job had been completely wiped >> so it was not in a sane state. scancel of this job worked fine - but no new >> jobs will start. Seems like there is "ghost job" blocking the other jobs >> from starting? I even tried to reinstall slurm using the package manager, >> but the new slurm installation would still not start jobs. Any tips on how >> I can proceed to debug this? >> >> Regards >> >> Joakim >> >