Re: [slurm-users] status of cloud nodes

Sam Gallop (NBI) Tue, 18 Jun 2019 04:27:14 -0700

Hi Nathan,

The command I use to get the reason for failed nodes is ... 'sinfo -Ral'. If 
you need to extend the width of the output then ... 'sinfo -Ral -O 
reason:35,user,timestamp,statelong,nodelist'.


Using the timestamp of the failure look in the slurmd or slurmctld logs.

---
Sam Gallop

-----Original Message-----
From: slurm-users <slurm-users-boun...@lists.schedmd.com> On Behalf Of nathan 
norton
Sent: 18 June 2019 09:33
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] status of cloud nodes

Hi all,

I am using slurm with a cloud provider it is all working a treat.

lets say i have 100 nodes all working fine and able to be scheduled, everything 
works fine.

$ srun -N100 hostname

works fine.

For some unknown reason after machines shut down for example over the weekend 
if no jobs get scheduled for an hour. The next time a job runs

$srun -N90 hostname

fails with:

"srun: Required node not available (down, drained or reserved)"

"srun: job JOBID queued and waiting for resources"

This is weird as no other jobs are running and i should be able to start up the 
nodes as requested.


Being 'cloud' type nodes if i run

$scontrol show node

only the up and working nodes are displayed and not the failed nodes. 
how do i get the failed nodes information?

if i stop all nodes and run below i can then start up all nodes again

scontrol update NodeName=node-1-100 State=DOWN Reason="undraining"
scontrol update NodeName=node-1-100 State=RESUME
scontrol: show node node


So that fixes it, but i want to figure out why nodes get into this state 
and how can i monitor it ? is there a command to get the status of CLOUD 
nodes?

any help appreciated

Thanks

Nathan.

Re: [slurm-users] status of cloud nodes

Reply via email to