On Wed, 2017-10-18 at 09:36:42 -0600, Michael Di Domenico wrote:
> 
> is there anyway after a job starts to determine why the scheduler
> choose the series of nodes it did?
> 
> for some reason on an empty cluster when i spin up a large job it's
> staggering the allocation across a seemingly random allocation of
> nodes

Have you looked into topology? With topology.conf, you may group nodes
by (virtually or really, Slurm doesn't check nor care) connecting them
to network switches... adding some "locality" to your cluster setup

> we're using backfill/cons_res + gres, and all the nodes are identical.

Why do you care about the randomness then?

> in the past it used to select the next node past a down node and then
> start sequential from there.
> 
> i haven't made (or are not aware of ) any changes in the system, but
> now it's skipping nodes that presumably should have been in the
> allocation

"skipping nodes" and "seemingly random" aren't even close to me.
Did you check that the skipped nodes aren't (a) drained or down, or (b)
assigned to other jobs?

Cheers,
 S

-- 
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am M�hlenberg 1
D-14476 Potsdam-Golm
Germany
~~~
Fon: +49-331-567 7274
Fax: +49-331-567 7298
Mail: steffen.grunewald(at)aei.mpg.de
~~~

Reply via email to