Hi Cyrus,

Thank you for the links. I've taken a good look through the first link (re the 
cloud cluster) and the only parameter that might be relevant is 
"assoc_limit_stop", but I'm not sure if that is relevant in this instance. The 
reason for the delay of the job in question is "priority", however there are 
quite a lot of jobs from users in the same accounting group with jobs delayed 
due to "QOSMaxCpuPerUserLimit". They also talk about using the "builtin" 
scheduler which I guess would turn off backfill.


I have attached a copy of the current slurm.conf so that you and other members 
can get a better feel for the whole picture. Certainly we see a large number of 
serial/small (1 node) jobs running through the system and I'm concerned that my 
setup encourages this behaviour, however how to stem this issue is a mystery to 
me.


If you or anyone else has any relevant thoughts then please let me know. I 
particular I am keen to understand "assoc_limit_stop" and whether it is a 
relevant option in this situation.


Best regards,

David

________________________________
From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Cyrus 
Proctor <cproc...@tacc.utexas.edu>
Sent: 21 March 2019 14:19
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] Very large job getting starved out


Hi David,


You might have a look at the thread "Large job starvation on cloud cluster" 
that started on Feb 27; there's some good tidbits in there. Off the top without 
more information, I would venture that settings you have in slurm.conf end up 
backfilling the smaller jobs at the expense of scheduling the larger jobs.


Your partition configs plus accounting and scheduler configs from slurm.conf 
would be helpful.


Also, search for "job starvation" here: 
https://slurm.schedmd.com/sched_config.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsched_config.html&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cea23798d0ad54a02f14308d6ae0883d5%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=KfjAqNHQgLcUBBYwZFi8OygU2De%2FdVuTwbdOmUv0Dps%3D&reserved=0>
 as another potential starting point.


Best,

Cyrus


On 3/21/19 8:55 AM, David Baker wrote:

Hello,


I understand that this is not a straight forward question, however I'm 
wondering if anyone has any useful ideas, please. Our cluster is busy and the 
QOS has limited users to a maximum of 32 compute nodes on the "batch" queue. 
Users are making good of the cluster -- for example one user is running five 6 
node jobs at the moment. On the other hand, a job belonging to another user has 
been stalled in the queue for around 7 days. He has made reasonable use of the 
cluster and as a result his fairshare component is relatively low. Having said 
that, the priority of his job is high -- it currently one of the highest 
priority jobs in the batch partition queue. From sprio...


JOBID PARTITION   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        
QOS

359323 batch         180292     100000      79646        547        100         
 0


I did think that the PriorityDecayHalfLife was quite high at 14 days and so I 
reduced that to 7 days. For reference I've included the key scheduling settings 
from the cluster below. Does anyone have any thoughts, please?


Best regards,

David


PriorityDecayHalfLife   = 7-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME,FAIR_TREE
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 100000
PriorityWeightFairShare = 1000000
PriorityWeightJobSize   = 10000000
PriorityWeightPartition = 1000
PriorityWeightQOS       = 10000



Attachment: slurm.conf
Description: slurm.conf

Reply via email to