On 2021/07/05 11:39, Kevin Buckley wrote:
Upgrade our Cray TDS from 20.11.7 to 20.11.8, without making any
changes to the configuration but am not now seeing job start to
run, whilst seeing messages in the slurmd log akin to these four

   Submitted federated JobId=67122494 to tdsname(self)
   _slurm_rpc_submit_batch_job: JobId=67122494 InitPrio=0 usec=8208
   sched: schedule() returning, federation siblings not synced yet
   sched/backfill: _attempt_backfill: returning, federation siblings not synced 
yet


none of which were in evidence prior to the upgrade.

Didn't see anything in the 20.11.8 changes that suggested anything
to do with "federation" had been introduced, though yet to trawl
through the code.

Anyone seen similar?

Kevin


Starting to look as though something federation-related may have been
"fixed" in 20.11.8, or "unfixed" for combinations of federations of
differing Slurm versions?

Even if I leave the Cray TDS cluster in a federation of one - it had
previoulsy been operating within a federation of two, with a non-Cray
TDS cluster - then jobs start to run within it again.

--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre

Reply via email to