Hi Nathan, The command I use to get the reason for failed nodes is ... 'sinfo -Ral'. If you need to extend the width of the output then ... 'sinfo -Ral -O reason:35,user,timestamp,statelong,nodelist'.
Using the timestamp of the failure look in the slurmd or slurmctld logs. --- Sam Gallop -----Original Message----- From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of nathan norton Sent: 18 June 2019 09:33 To: slurm-users@lists.schedmd.com Subject: [slurm-users] status of cloud nodes Hi all, I am using slurm with a cloud provider it is all working a treat. lets say i have 100 nodes all working fine and able to be scheduled, everything works fine. $ srun -N100 hostname works fine. For some unknown reason after machines shut down for example over the weekend if no jobs get scheduled for an hour. The next time a job runs $srun -N90 hostname fails with: "srun: Required node not available (down, drained or reserved)" "srun: job JOBID queued and waiting for resources" This is weird as no other jobs are running and i should be able to start up the nodes as requested. Being 'cloud' type nodes if i run $scontrol show node only the up and working nodes are displayed and not the failed nodes. how do i get the failed nodes information? if i stop all nodes and run below i can then start up all nodes again scontrol update NodeName=node-1-100 State=DOWN Reason="undraining" scontrol update NodeName=node-1-100 State=RESUME scontrol: show node node So that fixes it, but i want to figure out why nodes get into this state and how can i monitor it ? is there a command to get the status of CLOUD nodes? any help appreciated Thanks Nathan.