On Wed, 2017-10-18 at 09:36:42 -0600, Michael Di Domenico wrote: > > is there anyway after a job starts to determine why the scheduler > choose the series of nodes it did? > > for some reason on an empty cluster when i spin up a large job it's > staggering the allocation across a seemingly random allocation of > nodes
Have you looked into topology? With topology.conf, you may group nodes by (virtually or really, Slurm doesn't check nor care) connecting them to network switches... adding some "locality" to your cluster setup > we're using backfill/cons_res + gres, and all the nodes are identical. Why do you care about the randomness then? > in the past it used to select the next node past a down node and then > start sequential from there. > > i haven't made (or are not aware of ) any changes in the system, but > now it's skipping nodes that presumably should have been in the > allocation "skipping nodes" and "seemingly random" aren't even close to me. Did you check that the skipped nodes aren't (a) drained or down, or (b) assigned to other jobs? Cheers, S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am M�hlenberg 1 D-14476 Potsdam-Golm Germany ~~~ Fon: +49-331-567 7274 Fax: +49-331-567 7298 Mail: steffen.grunewald(at)aei.mpg.de ~~~