On 10/25/2017 01:52 PM, Holger Naundorf wrote:
I'd really appreciate any help the SLURM wizards can provide! We suspect
it's something to do with how we've set up QoS or maybe, we need to
tweak the scheduler configuration in 17.02.8 however there's no single
clear path forward. Just let me know if there's any further information
I can provide to help troubleshoot or give fodder for suggestions.
While I am in no way a SLURM wizard - one thing i would try is
increasing 'bf_max_job_test' to s.th. much bigger (in the order of the
usual length of your queued up jobs). In this setting (as far as I
understand it) as soon as your 50 top priority queued jobs are waiting
for 'legitimate' reasons (i.e. their designated nodes/QOS is full)
everything below them will not get backfilled anymore.
I agree that the backfill scheduler requires configuration beyond the
default settings! This surprised me as well. I wrote some notes in my
Wiki which could be used as a starting point:
https://wiki.fysik.dtu.dk/niflheim/Slurm_scheduler#backfill-scheduler
/Ole