Greetings, fellow general university resource administrator.

Couple things come to mind from my experience:

1) does your serial partition share nodes with the other non-serial partitions?

2) what’s your maximum job time allowed, for serial (if the previous answer was 
“yes”) and non-serial partitions? Are your users submitting particularly longer 
jobs compared to earlier?

3) are you using the backfill scheduler at all?

--
Mike Renfro, PhD  / HPC Systems Administrator, Information Technology Services
931 372-3601<tel:931%20372-3601>      / Tennessee Tech University

On Jan 31, 2020, at 6:23 AM, David Baker <d.j.ba...@soton.ac.uk> wrote:

Hello,

Our SLURM cluster is relatively small. We have 350 standard compute nodes each 
with 40 cores. The largest job that users  can run on the partition is one 
requesting 32 nodes. Our cluster is a general university research resource and 
so there are many different sizes of jobs ranging from single core jobs, that 
get routed to a serial partition via the job-submit.lua, through to jobs 
requesting 32 nodes. When we first started the service, 32 node jobs were 
typically taking in the region of 2 days to schedule -- recently queuing times 
have started to get out of hand. Our setup is essentially...

PriorityFavorSmall=NO
FairShareDampeningFactor=5
PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0

PriorityWeightAge=400000
PriorityWeightPartition=1000
PriorityWeightJobSize=500000
PriorityWeightQOS=1000000
PriorityMaxAge=7-0

To try to reduce the queuing times for our bigger jobs should we potentially 
increase the PriorityWeightJobSize factor in the first instance to bump up the 
priority of such jobs? Or should we potentially define a set of QOSs which we 
assign to jobs in our job_submit.lua depending on the size of the job. In other 
words, let's say there is large QOS that give the largest jobs a higher 
priority, and also limits how many of those jobs that a single user can submit?

Your advice would be appreciated, please. At the moment these large jobs are 
not accruing a sufficiently high priority to rise above the other jobs in the 
cluster.

Best regards,
David

Reply via email to