That looks very promising. We will have to increase the timeout due to
an extensive Ansible setup run on each node before it is ready, but the
idea should work nonetheless. I will try it out.

Thank you!

Xaver

On 11/15/24 14:36, Schneider, Gerald wrote:

My approach would be something along this:

sinfo -t POWERED_DOWN -o %n -h | parallel -i -j 10 --timeout 900 srun
-w {} hostname

sinfo lists all powered down nodes and the output gets piped into
parallel. parallel will then run 10 (or how many you want) srun
instances simultaneously, with a timout of 900 seconds to give the
hosts enough time to power up. If everything works parallel exists
with 0, otherwise it will sum up the exit codes.

works for me like charm, only downside is that parallel is usually
needs to be installed for that. But it’s useful for other cases as well.

Regards,

Gerald Schneider

--

Gerald Schneider

tech. Mitarbeiter

IT-SC

Fraunhofer-Institut für Graphische Datenverarbeitung IGD

Joachim-Jungius-Str. 11 | 18059 Rostock | Germany

Telefon +49 6151 155-309 | +49 381 4024-193 | Fax +49 381 4024-199

gerald.schnei...@igd-r.fraunhofer.de
<mailto:gerald.schnei...@igd-r.fraunhofer.de>| www.igd.fraunhofer.de
<http://www.igd.fraunhofer.de/>

*From:*Xaver Stiensmeier via slurm-users <slurm-users@lists.schedmd.com>
*Sent:* Freitag, 15. November 2024 14:03
*To:* slurm-users@lists.schedmd.com
*Subject:* [slurm-users] Re: How to power up all ~idle nodes and
verify that they have started up without issue programmatically

Maybe to expand on this even further:

I would like to run something that waits and comes back with a 0 when
all workers have been powered up (resume script ran without an issue)
and comes back with =/= 0 (for example 1) otherwise. Then I could
start other routines to complete the integration test.

And my personal idea was to use something like:

    scontrol show nodes | awk '/NodeName=/ {print $1}' | sed
    's/NodeName=//' | sort -u | xargs -Inodenames srun -w nodenames
    hostname

to execute the hostname command on all instances which forces them to
power up. However, that feels a bit clunky and the output is
definitely not perfect as it needs parsing.

Best regards,
Xaver

On 11/14/24 14:36, Xaver Stiensmeier wrote:

    Hi Ole,

    thank you for your answer!

    I apologize for the unclear wording. We have already implemented
    the on demand scheduling.
    However, we have not provided a HealthCheckProgram yet (which
    simply means that the node starts without health check). I will
    look into it regardless of my question.

    Back to the question: I am aware that sinfo contains the
    information, but I am basically looking for a method like sinfo
    that produces a more machine friendly output as I want to verify
    the correct start of all nodes programmatically. ClusterShell is
    also on my list of software to try out in general.

    *More Context*

    We are maintaining a tool that creates Slurm clusters in OpenStack
    from configuration files and for that we would like to write
    integration tests. Therefore, we would like to be able to test
    (CI/CD) whether the slurm cluster behaves as expected given
    certain configurations of our program. Of course this includes
    checking whether the nodes power up.

    Best regards,
    Xaver

    Am 14/11/2024 um 14:57 schrieb Ole Holm Nielsen via slurm-users:

        Hi Xaver,

        On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote:

            I would like to startup all ~idle (idle and powered down)
            nodes and check programmatically if all came up as
            expected. For context: this is for a program that sets up
            slurm clusters with on demand cloud scheduling.

            In the most easiest fashion this could be executing a
            command like *srun FORALL hostname* which would return the
            names of the nodes if it succeeds and an error message
            otherwise. However, there's no such input value like
            FORALL as far as I am aware. One could use -N{total node
            number} as all nodes are ~idle when this executes, but I
            don't know an easy way to get the total number of nodes.


        There exists good documentation around this, and I recommend
        to start with the Slurm Power Saving Guide
        (https://slurm.schedmd.com/power_save.html)

        When you have developed a method to power up your cloud nodes,
        the slurmd's will register with slurmctld when they are
        started.  Simply using the "sinfo" command will tell you which
        nodes are up (idle) and which are still in a powered-down
        state (idle~).

        When slurmd starts up, it calls the HealthCheckProgram defined
        in slurm.conf to verify that the node is healthy - strongly
        recommended.  The slurmd won't start if HealthCheckProgram
        gives a faulty status, and you'll need to check such nodes
        manually.

        So there should not be any need to execute commands on the nodes.

        If you wish, you can stll run a command on all "idle" nodes,
        for example using ClusterShell[1]:

        $ clush -bw@slurmstate:idle uname -r

        Best regards,
        Ole

        [1] The Wiki page
        https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#clustershell
        shows example usage of ClusterShell

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to