Re: [slurm-users] monitoring and update regime for Power Saving nodes

Hermann Schwärzler Thu, 24 Feb 2022 02:27:58 -0800

Hi everybody,

for forcing a run of your config management as Tina suggested you mightjust add a


ExecStartPre=

line to your slurmd.service file?

This is somewhat unrelated to your problem but we are very successfullyusing


ExecStartPre=-/usr/bin/nvidia-smi -L

in our slurmd.service file to make sure that all the GPU-devices arevisible and available on our GPU-nodes *before* slurmd starts. Of coursethe dash after the "=" is important to make systemd ignore potentialerrors when running that command.


Hermann

On 2/24/22 10:42 AM, Tina Friedrich wrote:

Hi David,
it's also not actually a problem if the slurm.conf is not exactly thesame immediately on boot - really. Unless there's changes that are veryfundamental, nothing bad will happen if they pick up a new copy after,say, 5 or 10 minutes.
But it should be possible to - for example - force a run of your configmanagement on startup (or before SLURM startup)?
(Not many ideas about the Nagios check, unless you change it tosomething that interrogates SLURM about node states, or keep some otherrecord somewhere that it can interrogate about nodes meant to be down.)
Tina

On 24/02/2022 09:20, David Simpson wrote:
Hi Brian,
>>For monitoring, I use a combination of netdata+prometheus. Data isgathered whenever the nodes are up and stored for history. Yes, whenthe nodes are powered down, there are empty gaps, but that isinterpreted as the node is powered down.
Ah time-series will cope much better - at the moment our monitoringsystem (for compute node health at least) is nagios-like, hence theproblem. Though there’s potential the entire cluster’s stack maychange at some point, so this problem will be more easy to deal with(with a change of monitoring system for node health).
>>For the config, I have no access to DNS for configless so I use asymlink to the slurm.conf file a shared filesystem. This works great.Anytime there are changes, a simple 'scontrol reconfigure' brings allrunning nodes up to speed and any down nodes will automatically readthe latest.
Yes, currently we use file based and config written to the computenode’s disks themselves via ansible. Perhaps we will consider movingthe file to a shared fs.
regards
David

-------------

David Simpson - Senior Systems Engineer

ARCCA, Redwood Building,

King Edward VII Avenue,

Cardiff, CF10 3NB

David Simpson - peiriannydd uwch systemau

ARCCA, Adeilad Redwood,

King Edward VII Avenue,

Caerdydd, CF10 3NB

simpso...@cardiff.ac.uk <mailto:simpso...@cardiff.ac.uk>

+44 29208 74657
*From:*slurm-users <slurm-users-boun...@lists.schedmd.com> *On BehalfOf *Brian Andrus
*Sent:* 23 February 2022 15:27
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] monitoring and update regime for PowerSaving nodes
*External email to Cardiff University - *Take care whenreplying/opening attachments or links.
*Nid ebost mewnol o Brifysgol Caerdydd yw hwn - *Cymerwch ofal wrthateb/agor atodiadau neu ddolenni.
David,
For monitoring, I use a combination of netdata+prometheus. Data isgathered whenever the nodes are up and stored for history. Yes, whenthe nodes are powered down, there are empty gaps, but that isinterpreted as the node is powered down.
For the config, I have no access to DNS for configless so I use asymlink to the slurm.conf file a shared filesystem. This works great.Anytime there are changes, a simple 'scontrol reconfigure' brings allrunning nodes up to speed and any down nodes will automatically readthe latest.
Brian Andrus

On 2/23/2022 2:31 AM, David Simpson wrote:

    Hi all,

    Interested to know what common approaches were to:


     1. Monitoring of power saving nodes (e.g. health of the node), when
        potentially the monitoring system will see it go up and down. Do
        you limit to BMC only monitoring/health?
     2. When you want to make changes to slurm.conf (or anything else)
        to a node which is down due to power saving (during a
        maintenance/reservation) what is your approach? Do you end up
        with 2 slurm.confs (one for power saving and one that keeps
        everything up, to work on during the maintenance)?


    thanks
    David


    -------------

    David Simpson - Senior Systems Engineer

    ARCCA, Redwood Building,

    King Edward VII Avenue,

    Cardiff, CF10 3NB

    David Simpson - peiriannydd uwch systemau

    ARCCA, Adeilad Redwood,

    King Edward VII Avenue,

    Caerdydd, CF10 3NB

    simpso...@cardiff.ac.uk <mailto:simpso...@cardiff.ac.uk>

    +44 29208 74657

Re: [slurm-users] monitoring and update regime for Power Saving nodes

Reply via email to