@Tina,
Figure slurmd reads the config in ones and runs with it. You would need
to have it recheck regularly to see if there are any changes. This is
exactly what 'scontrol reconfig' does: tells all the slurm nodes to
recheck the config.
@Steven,
It seems to me you could just have a monitor daemon that keeps things
up-to-date.
It could watch for the alert that AWS sends (2 minute warning, IIRC) and
take appropriate action of drain the node and cancel/checkpoint a job.
In addition, it could keep an eye on things in the event a warning
wasn't received and a node 'vanishes'. I suspect Nagios even has the
hooks to make that work. You could also email the user to let them know
their job was ended due to spot being pulled.
Just some ideas,
Brian Andrus
On 5/5/2022 6:28 AM, Steven Varga wrote:
Hi Tina,
Thank you for sharing. This matches my observations when I checked if
slurm could do what I am upto: manage AWS EC2 dynamic(spot) instances.
After replacing MySQL with REDIS now i wonder what would it take to
make slurm node addition | removal dynamic. I've been looking at the
source code for many months now and trying to decide if it can be done.
I am using configless, 3 controllers, 2 slurmdbs with a redis sentinel
based robust backend.
Steven
On Thu., May 5, 2022, 08:57 Tina Friedrich,
<tina.friedr...@it.ox.ac.uk> wrote:
Hi List,
out of curiosity - I would assume that if running configless, one
doesn't manually need to restart slurmd on the nodes if the config
changes?
Hi Steven,
I have no idea if you want to do it every couple of minutes and
what the
implications are of that (although I've certainly manage to
restart them
every 5 minutes by accident with no real problems caused), but -
generally, restarting the daemons (slurmctld, slurmd) is a
non-issue, as
it's a safe operation. There's no risk to running jobs or anything. I
have the config management restart them if any files change. It also
doesn't seem to matter if the restarts of the controller & the node
daemons are splayed a bit (i.e. don't happen at the same time), or
what
order they happen in.
Tina
On 05/05/2022 13:17, Steven Varga wrote:
> Thank you for the quick reply! I know I am pushing my luck here:
is it
> possible to modify slurm: src/common/[read_conf.c, node_conf.c]
> src/slurmctld/[read_config.c, ...] such that the state can be
maintained
> dynamically? -- or cheaper to write a job manager with less
features but
> supporting dynamic nodes from ground up?
> best wishes: steve
>
> On Thu, May 5, 2022 at 12:29 AM Christopher Samuel
<ch...@csamuel.org
> <mailto:ch...@csamuel.org>> wrote:
>
> On 5/4/22 7:26 pm, Steven Varga wrote:
>
> > I am wondering what is the best way to update node
changes, such as
> > addition and removal of nodes to SLURM. The excerpts
below suggest a
> > full restart, can someone confirm this?
>
> You are correct, you need to restart slurmctld and slurmd
daemons at
> present. See https://slurm.schedmd.com/faq.html#add_nodes
> <https://slurm.schedmd.com/faq.html#add_nodes>
>
> All the best,
> Chris
> --
> Chris Samuel : http://www.csamuel.org/
<http://www.csamuel.org/>
> : Berkeley, CA, USA
>
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems
Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk