[slurm-users] Re: How to power up all ~idle nodes and verify that they have started up without issue programmatically

Schneider, Gerald via slurm-users Fri, 15 Nov 2024 05:37:08 -0800

My approach would be something along this:

sinfo -t POWERED_DOWN -o %n -h | parallel -i -j 10 --timeout 900 srun -w {} 
hostname


sinfo lists all powered down nodes and the output gets piped into parallel. 
parallel will then run 10 (or how many you want) srun instances simultaneously, 
with a timout of 900 seconds to give the hosts enough time to power up. If 
everything works parallel exists with 0, otherwise it will sum up the exit 
codes.

works for me like charm, only downside is that parallel is usually needs to be 
installed for that. But it’s useful for other cases as well.

Regards,
Gerald Schneider

--
Gerald Schneider
tech. Mitarbeiter
IT-SC

Fraunhofer-Institut für Graphische Datenverarbeitung IGD
Joachim-Jungius-Str. 11 | 18059 Rostock | Germany
Telefon +49 6151 155-309 | +49 381 4024-193 | Fax +49 381 4024-199
gerald.schnei...@igd-r.fraunhofer.de<mailto:gerald.schnei...@igd-r.fraunhofer.de>
 | www.igd.fraunhofer.de<http://www.igd.fraunhofer.de/>

From: Xaver Stiensmeier via slurm-users <slurm-users@lists.schedmd.com>
Sent: Freitag, 15. November 2024 14:03
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Re: How to power up all ~idle nodes and verify that they 
have started up without issue programmatically


Maybe to expand on this even further:

I would like to run something that waits and comes back with a 0 when all 
workers have been powered up (resume script ran without an issue) and comes 
back with =/= 0 (for example 1) otherwise. Then I could start other routines to 
complete the integration test.

And my personal idea was to use something like:

scontrol show nodes | awk '/NodeName=/ {print $1}' | sed 's/NodeName=//' | sort 
-u | xargs -Inodenames srun -w nodenames hostname

to execute the hostname command on all instances which forces them to power up. 
However, that feels a bit clunky and the output is definitely not perfect as it 
needs parsing.

Best regards,
Xaver
On 11/14/24 14:36, Xaver Stiensmeier wrote:

Hi Ole,

thank you for your answer!

I apologize for the unclear wording. We have already implemented the on demand 
scheduling.
However, we have not provided a HealthCheckProgram yet (which simply means that 
the node starts without health check). I will look into it regardless of my 
question.

Back to the question: I am aware that sinfo contains the information, but I am 
basically looking for a method like sinfo that produces a more machine friendly 
output as I want to verify the correct start of all nodes programmatically. 
ClusterShell is also on my list of software to try out in general.

More Context

We are maintaining a tool that creates Slurm clusters in OpenStack from 
configuration files and for that we would like to write integration tests. 
Therefore, we would like to be able to test (CI/CD) whether the slurm cluster 
behaves as expected given certain configurations of our program. Of course this 
includes checking whether the nodes power up.

Best regards,
Xaver
Am 14/11/2024 um 14:57 schrieb Ole Holm Nielsen via slurm-users:
Hi Xaver,

On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote:

I would like to startup all ~idle (idle and powered down) nodes and check 
programmatically if all came up as expected. For context: this is for a program 
that sets up slurm clusters with on demand cloud scheduling.

In the most easiest fashion this could be executing a command like *srun FORALL 
hostname* which would return the names of the nodes if it succeeds and an error 
message otherwise. However, there's no such input value like FORALL as far as I 
am aware. One could use -N{total node number} as all nodes are ~idle when this 
executes, but I don't know an easy way to get the total number of nodes.

There exists good documentation around this, and I recommend to start with the 
Slurm Power Saving Guide (https://slurm.schedmd.com/power_save.html)

When you have developed a method to power up your cloud nodes, the slurmd's 
will register with slurmctld when they are started.  Simply using the "sinfo" 
command will tell you which nodes are up (idle) and which are still in a 
powered-down state (idle~).

When slurmd starts up, it calls the HealthCheckProgram defined in slurm.conf to 
verify that the node is healthy - strongly recommended.  The slurmd won't start 
if HealthCheckProgram gives a faulty status, and you'll need to check such 
nodes manually.

So there should not be any need to execute commands on the nodes.

If you wish, you can stll run a command on all "idle" nodes, for example using 
ClusterShell[1]:

$ clush -bw@slurmstate:idle uname -r

Best regards,
Ole

[1] The Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#clustershell shows 
example usage of ClusterShell

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: How to power up all ~idle nodes and verify that they have started up without issue programmatically

Reply via email to