Re: [slurm-users] SLURM: reconfig

Brian Andrus Thu, 05 May 2022 06:49:48 -0700

@Tina,

Figure slurmd reads the config in ones and runs with it. You would needto have it recheck regularly to see if there are any changes. This isexactly what 'scontrol reconfig' does: tells all the slurm nodes torecheck the config.



@Steven,

It seems to me you could just have a monitor daemon that keeps thingsup-to-date.It could watch for the alert that AWS sends (2 minute warning, IIRC) andtake appropriate action of drain the node and cancel/checkpoint a job.In addition, it could keep an eye on things in the event a warningwasn't received and a node 'vanishes'. I suspect Nagios even has thehooks to make that work. You could also email the user to let them knowtheir job was ended due to spot being pulled.


Just some ideas,

Brian Andrus

On 5/5/2022 6:28 AM, Steven Varga wrote:

Hi Tina,

Thank you for sharing. This matches my observations when I checked ifslurm could do what I am upto: manage AWS EC2 dynamic(spot) instances.

After replacing MySQL with REDIS now i wonder what would it take tomake slurm node addition | removal dynamic. I've been looking at thesource code for many months now and trying to decide if it can be done.

I am using configless, 3 controllers, 2 slurmdbs with a redis sentinelbased robust backend.


Steven

On Thu., May 5, 2022, 08:57 Tina Friedrich,<tina.friedr...@it.ox.ac.uk> wrote:


    Hi List,

    out of curiosity - I would assume that if running configless, one
    doesn't manually need to restart slurmd on the nodes if the config
    changes?

    Hi Steven,

    I have no idea if you want to do it every couple of minutes and
    what the
    implications are of that (although I've certainly manage to
    restart them
    every 5 minutes by accident with no real problems caused), but -
    generally, restarting the daemons (slurmctld, slurmd) is a
    non-issue, as
    it's a safe operation. There's no risk to running jobs or anything. I
    have the config management restart them if any files change. It also
    doesn't seem to matter if the restarts of the controller & the node
    daemons are splayed a bit (i.e. don't happen at the same time), or
    what
    order they happen in.

    Tina

    On 05/05/2022 13:17, Steven Varga wrote:
    > Thank you for the quick reply! I know I am pushing my luck here:
    is it
    > possible to modify slurm: src/common/[read_conf.c, node_conf.c]
    > src/slurmctld/[read_config.c, ...] such that the state can be
    maintained
    > dynamically? -- or cheaper to write a job manager with less
    features but
    > supporting dynamic nodes from ground up?
    > best wishes: steve
    >
    > On Thu, May 5, 2022 at 12:29 AM Christopher Samuel
    <ch...@csamuel.org
    > <mailto:ch...@csamuel.org>> wrote:
    >
    >     On 5/4/22 7:26 pm, Steven Varga wrote:
    >
    >      > I am wondering what is the best way to update node
    changes, such as
    >      > addition and removal of nodes to SLURM. The excerpts
    below suggest a
    >      > full restart, can someone confirm this?
    >
    >     You are correct, you need to restart slurmctld and slurmd
    daemons at
    >     present.  See https://slurm.schedmd.com/faq.html#add_nodes
    >     <https://slurm.schedmd.com/faq.html#add_nodes>
    >
    >     All the best,
    >     Chris
    >     --
    >     Chris Samuel  : http://www.csamuel.org/
    <http://www.csamuel.org/>
    >     :  Berkeley, CA, USA
    >

--Tina Friedrich, Advanced Research Computing Snr HPC Systems

    Administrator

    Research Computing and Support Services
    IT Services, University of Oxford
    http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Re: [slurm-users] SLURM: reconfig

Reply via email to