@Tina,

Figure slurmd reads the config in ones and runs with it. You would need to have it recheck regularly to see if there are any changes. This is exactly what 'scontrol reconfig' does: tells all the slurm nodes to recheck the config.


@Steven,

It seems to me you could just have a monitor daemon that keeps things up-to-date. It could watch for the alert that AWS sends (2 minute warning, IIRC) and take appropriate action of drain the node and cancel/checkpoint a job. In addition, it could keep an eye on things in the event a warning wasn't received and a node 'vanishes'.  I suspect Nagios even has the hooks to make that work. You could also email the user to let them know their job was ended due to spot being pulled.

Just some ideas,

Brian Andrus

On 5/5/2022 6:28 AM, Steven Varga wrote:
Hi Tina,
Thank you for sharing. This matches my observations when I checked if slurm could do what I am upto: manage AWS EC2 dynamic(spot) instances.

After replacing MySQL with REDIS now i wonder what would it take to make slurm node addition | removal dynamic. I've been looking at the source code for many months now and trying to decide if it can be done.

I am using configless, 3 controllers, 2 slurmdbs with a redis sentinel based robust backend.

Steven


On Thu., May 5, 2022, 08:57 Tina Friedrich, <tina.friedr...@it.ox.ac.uk> wrote:

    Hi List,

    out of curiosity - I would assume that if running configless, one
    doesn't manually need to restart slurmd on the nodes if the config
    changes?

    Hi Steven,

    I have no idea if you want to do it every couple of minutes and
    what the
    implications are of that (although I've certainly manage to
    restart them
    every 5 minutes by accident with no real problems caused), but -
    generally, restarting the daemons (slurmctld, slurmd) is a
    non-issue, as
    it's a safe operation. There's no risk to running jobs or anything. I
    have the config management restart them if any files change. It also
    doesn't seem to matter if the restarts of the controller & the node
    daemons are splayed a bit (i.e. don't happen at the same time), or
    what
    order they happen in.

    Tina

    On 05/05/2022 13:17, Steven Varga wrote:
    > Thank you for the quick reply! I know I am pushing my luck here:
    is it
    > possible to modify slurm: src/common/[read_conf.c, node_conf.c]
    > src/slurmctld/[read_config.c, ...] such that the state can be
    maintained
    > dynamically? -- or cheaper to write a job manager with less
    features but
    > supporting dynamic nodes from ground up?
    > best wishes: steve
    >
    > On Thu, May 5, 2022 at 12:29 AM Christopher Samuel
    <ch...@csamuel.org
    > <mailto:ch...@csamuel.org>> wrote:
    >
    >     On 5/4/22 7:26 pm, Steven Varga wrote:
    >
    >      > I am wondering what is the best way to update node
    changes, such as
    >      > addition and removal of nodes to SLURM. The excerpts
    below suggest a
    >      > full restart, can someone confirm this?
    >
    >     You are correct, you need to restart slurmctld and slurmd
    daemons at
    >     present.  See https://slurm.schedmd.com/faq.html#add_nodes
    >     <https://slurm.schedmd.com/faq.html#add_nodes>
    >
    >     All the best,
    >     Chris
    >     --
    >     Chris Samuel  : http://www.csamuel.org/
    <http://www.csamuel.org/>
    >     :  Berkeley, CA, USA
    >

-- Tina Friedrich, Advanced Research Computing Snr HPC Systems
    Administrator

    Research Computing and Support Services
    IT Services, University of Oxford
    http://www.arc.ox.ac.uk http://www.it.ox.ac.uk

Reply via email to