Hi Cyrus,
Thank you for the links. I've taken a good look through the first link (re the cloud cluster) and the only parameter that might be relevant is "assoc_limit_stop", but I'm not sure if that is relevant in this instance. The reason for the delay of the job in question is "priority", however there are quite a lot of jobs from users in the same accounting group with jobs delayed due to "QOSMaxCpuPerUserLimit". They also talk about using the "builtin" scheduler which I guess would turn off backfill. I have attached a copy of the current slurm.conf so that you and other members can get a better feel for the whole picture. Certainly we see a large number of serial/small (1 node) jobs running through the system and I'm concerned that my setup encourages this behaviour, however how to stem this issue is a mystery to me. If you or anyone else has any relevant thoughts then please let me know. I particular I am keen to understand "assoc_limit_stop" and whether it is a relevant option in this situation. Best regards, David ________________________________ From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Cyrus Proctor <cproc...@tacc.utexas.edu> Sent: 21 March 2019 14:19 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Very large job getting starved out Hi David, You might have a look at the thread "Large job starvation on cloud cluster" that started on Feb 27; there's some good tidbits in there. Off the top without more information, I would venture that settings you have in slurm.conf end up backfilling the smaller jobs at the expense of scheduling the larger jobs. Your partition configs plus accounting and scheduler configs from slurm.conf would be helpful. Also, search for "job starvation" here: https://slurm.schedmd.com/sched_config.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fsched_config.html&data=01%7C01%7Cd.j.baker%40soton.ac.uk%7Cea23798d0ad54a02f14308d6ae0883d5%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=KfjAqNHQgLcUBBYwZFi8OygU2De%2FdVuTwbdOmUv0Dps%3D&reserved=0> as another potential starting point. Best, Cyrus On 3/21/19 8:55 AM, David Baker wrote: Hello, I understand that this is not a straight forward question, however I'm wondering if anyone has any useful ideas, please. Our cluster is busy and the QOS has limited users to a maximum of 32 compute nodes on the "batch" queue. Users are making good of the cluster -- for example one user is running five 6 node jobs at the moment. On the other hand, a job belonging to another user has been stalled in the queue for around 7 days. He has made reasonable use of the cluster and as a result his fairshare component is relatively low. Having said that, the priority of his job is high -- it currently one of the highest priority jobs in the batch partition queue. From sprio... JOBID PARTITION PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS 359323 batch 180292 100000 79646 547 100 0 I did think that the PriorityDecayHalfLife was quite high at 14 days and so I reduced that to 7 days. For reference I've included the key scheduling settings from the cluster below. Does anyone have any thoughts, please? Best regards, David PriorityDecayHalfLife = 7-00:00:00 PriorityCalcPeriod = 00:05:00 PriorityFavorSmall = No PriorityFlags = ACCRUE_ALWAYS,SMALL_RELATIVE_TO_TIME,FAIR_TREE PriorityMaxAge = 7-00:00:00 PriorityUsageResetPeriod = NONE PriorityType = priority/multifactor PriorityWeightAge = 100000 PriorityWeightFairShare = 1000000 PriorityWeightJobSize = 10000000 PriorityWeightPartition = 1000 PriorityWeightQOS = 10000
slurm.conf
Description: slurm.conf