Hi Lois Thank you for sharing your multi priority configuration with us. I understand why you say about the QOS factor -- I've reduced it and increased the FS factor to see where that takes us. Our QOS factor is only there to ensure that test jobs gain a higher priority more quickly than other jobs, however on reflection our high QOS factor setting was well over the top.
We turned on a higher debugging level this morning to help us understand the situation better. There is a definite split between small and larger jobs. Given that our user group is growing and we only have 750 compute nodes then perhaps we should expect to see a lot of backfill while resources become available for larger jobs. Best regards, David ________________________________ From: Loris Bennett <loris.benn...@fu-berlin.de> Sent: 20 November 2018 15:58 To: Baker D.J. Cc: Slurm User Community List Subject: Re: [slurm-users] Excessive use of backfill on a cluster Hi David, We have PriorityType=priority/multifactor PriorityDecayHalfLife=14-0 PriorityWeightFairshare=10000000 PriorityWeightAge=10000 PriorityWeightPartition=10000 PriorityWeightJobSize=0 PriorityWeightQOS=10000 PriorityMaxAge=7-0 PriorityCalcPeriod=5 SchedulerType=sched/backfill SchedulerParameters=max_job_bf=50,bf_interval=60,bf_window=20160,default_queue_depth=1000 In particular our main priority factor, by a long way, is Fairshare, with a slight advantage for old jobs and QOSs with a short run-time. With your priority weights, QOS is the most important by a factor of 10. I'm not quite sure what effects this will have, other than that your priorities will be a bit more static, since the total priority will have a reduced time-dependency. We had to add the SchedulerParameters settings to get backfill working properly at all, but this obviously isn't your problem. Cheers, Loris Baker D.J. <d.j.ba...@soton.ac.uk> writes: > Hello, > > Thank you for your reply and for the explanation. That makes sense -- > your explanation of backfill is as we expected. I think it's more that > we are surprised that almost all our jobs were being scheduled using > backfill. We very rarely see any being scheduled normally. It could be > that we haven't actually tuned our priority weights particularly > well. We potentially need a setup that will allow users to everything > from small (including very small, small duration, test jobs with a > high QOS) to large jobs running over a range of times without too many > users losing out. Initially, we had our Age and Job size scaling > factors too low, but have currently got the setup shown below. > > Any thoughts, please? > > Best regards, > > David > > PriorityParameters = (null) > PriorityDecayHalfLife = 14-00:00:00 > PriorityCalcPeriod = 00:05:00 > PriorityFavorSmall = No > PriorityFlags = SMALL_RELATIVE_TO_TIME,DEPTH_OBLIVIOUS > PriorityMaxAge = 14-00:00:00 > PriorityUsageResetPeriod = NONE > PriorityType = priority/multifactor > PriorityWeightAge = 100000 > PriorityWeightFairShare = 100000 > PriorityWeightJobSize = 100000 > PriorityWeightPartition = 0 > PriorityWeightQOS = 1000000 > PriorityWeightTRES = (null) > PropagatePrioProcess = 0 > > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > From: Loris Bennett <loris.benn...@fu-berlin.de> > Sent: 20 November 2018 13:26:14 > To: Baker D.J. > Cc: Slurm User Community List > Subject: Re: [slurm-users] Excessive use of backfill on a cluster > Hi David, > > Baker D.J. <d.j.ba...@soton.ac.uk> writes: > >> Hello, >> >> We are running Slurm 18.08.0 on our cluster and I am concerned that >> Slurm appears to be using backfill scheduling excessively. In fact the >> vast majority of jobs are being scheduled using backfill. So, for >> example, I have just submitted a set of three serial jobs. They all >> started on a compute node that was completely free, but >> disconcertingly in the slurmctl log they were all reported as started >> using backfill and that isn't making sense... >> >> [2018-11-20T12:31:27.598] backfill: Started JobId=217031 in batch on red158 >> [2018-11-20T12:32:28.004] backfill: Started JobId=217032 in batch on red158 >> [2018-11-20T12:33:58.608] backfill: Started JobId=217033 in batch on red158 >> >> I either don't understand the context of backfill re slurm or the >> above is odd. Has anyone seem this "overuse" (unnecessary) use of >> backfill on their cluster and/or could offer advice, please. > > I am not sure what "excessive backfilling" might mean. If you have > a job which requires a large amount of resources to become available > before it can start, then backfilling will allow other jobs with a lower > priority to be run, if this can be achieved without delaying the start > of the large job. So if a job needs 100 nodes, at some point 99 of them > will be idle. Job which can start and finish before the 100th node > becomes available will indeed be backfilled on empty nodes. This is how > backfilling is supposed to work. > > Or am I misunderstanding your problem? > > Cheers, > > Loris -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de