Hi Xaver,
On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote:
I would like to startup all ~idle (idle and powered down) nodes and check
programmatically if all came up as expected. For context: this is for a
program that sets up slurm clusters with on demand cloud scheduling.
In the most easiest fashion this could be executing a command like *srun
FORALL hostname* which would return the names of the nodes if it succeeds
and an error message otherwise. However, there's no such input value like
FORALL as far as I am aware. One could use -N{total node number} as all
nodes are ~idle when this executes, but I don't know an easy way to get
the total number of nodes.
There exists good documentation around this, and I recommend to start with
the Slurm Power Saving Guide (https://slurm.schedmd.com/power_save.html)
When you have developed a method to power up your cloud nodes, the
slurmd's will register with slurmctld when they are started. Simply using
the "sinfo" command will tell you which nodes are up (idle) and which are
still in a powered-down state (idle~).
When slurmd starts up, it calls the HealthCheckProgram defined in
slurm.conf to verify that the node is healthy - strongly recommended. The
slurmd won't start if HealthCheckProgram gives a faulty status, and you'll
need to check such nodes manually.
So there should not be any need to execute commands on the nodes.
If you wish, you can stll run a command on all "idle" nodes, for example
using ClusterShell[1]:
$ clush -bw@slurmstate:idle uname -r
Best regards,
Ole
[1] The Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#clustershell
shows example usage of ClusterShell
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com