On 2021/07/05 11:39, Kevin Buckley wrote:
Upgrade our Cray TDS from 20.11.7 to 20.11.8, without making any
changes to the configuration but am not now seeing job start to
run, whilst seeing messages in the slurmd log akin to these four
Submitted federated JobId=67122494 to tdsname(self)
_slurm_rpc_submit_batch_job: JobId=67122494 InitPrio=0 usec=8208
sched: schedule() returning, federation siblings not synced yet
sched/backfill: _attempt_backfill: returning, federation siblings not synced
yet
none of which were in evidence prior to the upgrade.
Didn't see anything in the 20.11.8 changes that suggested anything
to do with "federation" had been introduced, though yet to trawl
through the code.
Anyone seen similar?
Kevin
Starting to look as though something federation-related may have been
"fixed" in 20.11.8, or "unfixed" for combinations of federations of
differing Slurm versions?
Even if I leave the Cray TDS cluster in a federation of one - it had
previoulsy been operating within a federation of two, with a non-Cray
TDS cluster - then jobs start to run within it again.
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre