[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

Bjørn-Helge Mevik Mon, 23 Oct 2017 04:33:15 -0700

Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> writes:

> I have added nodes to an existing partition several times using the same
> procedure which you describe, and no bad side effects have been noticed. This
> is a very normal kind of operation in a cluster, where hardware may be added
> or retired from time to time, while the cluster of course continues its normal
> production.  We must be able to do this, especially when transferring existing
> nodes into a new Slurm cluster.


I too have done the same a lot of times, and never seen any problem like
this.

> Douglas Jacobsen explained very well why problems may arise.  It seems to me
> that this completely rigid nodelist bit mask in the network is a Slurm design
> problem, and that it ought to be fixed.

The bitmask design is for speed, and given the problem of getting the
backfiller to be fast enough under certain loads (lots of small,
distributed jobs running, and a long queue of pending jobs), I
personally wouldn't want schedmd to sacrifice that for making updates of
node lists easier.  Especially since I haven't seen the problem JinSung
Kang reports. :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

signature.asc
Description: PGP signature

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

Reply via email to