Hello,

I've finally got the job throughput/turnaround to be reasonable in our cluster. 
Most of the time the job activity on the cluster sets the default QOS to 32 
nodes (there are 464 nodes in the default queue). Jobs requesting nodes close 
to the QOS level (for example 22 nodes) are scheduled within 24 hours which is 
better than it has been. Still I suspect there is room for improvement. I note 
that these large jobs still struggle to be given a starttime, however many jobs 
are now being given a starttime following my SchedulerParameters makeover.

I used advice from the mailing list and the Slurm high throughput document to 
help me make changes to the scheduling parameters. They are now...

SchedulerParameters=assoc_limit_continue,batch_sched_delay=20,bf_continue,bf_interval=300,bf_min_age_reserve=10800,bf_window=3600,bf_resolution=600,bf_yield_interval=1000000,partition_job_depth=500,sched_max_job_start=200,sched_min_interval=2000000

Also..
PriorityFavorSmall=NO
PriorityFlags=SMALL_RELATIVE_TO_TIME,ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityMaxAge=1-0

The most significant change was actually reducing "PriorityMaxAge" to 7-0 to 
1-0. Before that change the larger jobs could hang around in the queue for 
days. Does it make sense therefore to further reduce PriorityMaxAge to less 
than 1 day? Your advice would be appreciated, please.

Best regards,
David



Reply via email to