Hi Ole,
thank you for your answer!
I apologize for the unclear wording. We have already implemented the on
demand scheduling.
However, we have not provided a HealthCheckProgram yet (which simply
means that the node starts without health check). I will look into it
regardless of my question.
Back to the question: I am aware that sinfo contains the information,
but I am basically looking for a method like sinfo that produces a more
machine friendly output as I want to verify the correct start of all
nodes programmatically. ClusterShell is also on my list of software to
try out in general.
*More Context*
We are maintaining a tool that creates Slurm clusters in OpenStack from
configuration files and for that we would like to write integration
tests. Therefore, we would like to be able to test (CI/CD) whether the
slurm cluster behaves as expected given certain configurations of our
program. Of course this includes checking whether the nodes power up.
Best regards,
Xaver
Am 14/11/2024 um 14:57 schrieb Ole Holm Nielsen via slurm-users:
Hi Xaver,
On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote:
I would like to startup all ~idle (idle and powered down) nodes and
check programmatically if all came up as expected. For context: this
is for a program that sets up slurm clusters with on demand cloud
scheduling.
In the most easiest fashion this could be executing a command like
*srun FORALL hostname* which would return the names of the nodes if
it succeeds and an error message otherwise. However, there's no such
input value like FORALL as far as I am aware. One could use -N{total
node number} as all nodes are ~idle when this executes, but I don't
know an easy way to get the total number of nodes.
There exists good documentation around this, and I recommend to start
with the Slurm Power Saving Guide
(https://slurm.schedmd.com/power_save.html)
When you have developed a method to power up your cloud nodes, the
slurmd's will register with slurmctld when they are started. Simply
using the "sinfo" command will tell you which nodes are up (idle) and
which are still in a powered-down state (idle~).
When slurmd starts up, it calls the HealthCheckProgram defined in
slurm.conf to verify that the node is healthy - strongly recommended.
The slurmd won't start if HealthCheckProgram gives a faulty status,
and you'll need to check such nodes manually.
So there should not be any need to execute commands on the nodes.
If you wish, you can stll run a command on all "idle" nodes, for
example using ClusterShell[1]:
$ clush -bw@slurmstate:idle uname -r
Best regards,
Ole
[1] The Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#clustershell
shows example usage of ClusterShell
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com