Hello David, David Baker <d.j.ba...@soton.ac.uk> writes:
> Hello, > > I've taken a very good look at our cluster, however as yet not made > any significant changes. The one change that I did make was to > increase the "jobsizeweight". That's now our dominant parameter and it > does ensure that our largest jobs (> 20 nodes) are making it to the > top of the sprio listing which is what we want to see. > > These large jobs aren't making an progress despite the priority > lift. I additionally decreased the nice value of the job that sparked > this discussion. That is (looking at at sprio) there is a 32 node job > with a very high priority... > > JOBID PARTITION USER PRIORITY AGE FAIRSHARE JOBSIZE > PARTITION QOS NICE > 280919 batch mep1c10 1275481 400000 59827 415655 > 0 0 -400000 > > That job has been sitting in the queue for well over a week and it is > disconcerting that we never see nodes becoming idle in order to > service these large jobs. Nodes do become idle and then get scooped by > jobs started by backfill. Looking at the slurmctld logs I see that the > vast majority of jobs are being started via backfill -- including, for > example, a 24 node job. I see very few jobs allocated by the > scheduler. That is, messages like sched: Allocate JobId=296915 are few > and far between and I never see any of the large jobs being allocated > in the batch queue. > > Surely, this is not correct, however does anyone have any advice on > what to check, please? Have you looked at what 'sprio' says? I usually want to see the list sorted by priority and so call it like this: sprio -l -S "%Y" If you run scontrol show job <jobid> is the entry 'NodeList' ever anything other than '(null)'? Cheers, Loris > Best regards, > David > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Killian Murphy <killian.mur...@york.ac.uk> > Sent: 04 February 2020 10:48 > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] Longer queuing times for larger jobs > > Hi David. > > I'd love to hear back about the changes that you make and how they affect the > performance of your scheduler. > > Any chance you could let us know how things go? > > Killian > > On Tue, 4 Feb 2020 at 10:43, David Baker <d.j.ba...@soton.ac.uk> wrote: > > Hello, > > Thank you very much again for your comments and the details of your slurm > configuration. All the information is really useful. We are working on our > cluster right now and making some appropriate changes. > We'll see how we get on over the next 24 hours or so. > > Best regards, > David > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Renfro, Michael <ren...@tntech.edu> > Sent: 31 January 2020 22:08 > To: Slurm User Community List <slurm-users@lists.schedmd.com> > Subject: Re: [slurm-users] Longer queuing times for larger jobs > > Slurm 19.05 now, though all these settings were in effect on 17.02 until > quite recently. If I get some detail wrong below, I hope someone will correct > me. But this is our current working state. We’ve been able to > schedule 10-20k jobs per month since late 2017, and we successfully > scheduled 320k jobs over December and January (largely due to one user using > some form of automated submission for very short jobs). > > Basic scheduler setup: > > As I’d said previously, we prioritize on fairshare almost exclusively. Most > of our jobs (molecular dynamics, CFD) end up in a single batch partition, > since GPU and big-memory jobs have other partitions. > > SelectType=select/cons_res > SelectTypeParameters=CR_Core_Memory > PriorityType=priority/multifactor > PriorityDecayHalfLife=14-0 > PriorityWeightFairshare=100000 > PriorityWeightAge=1000 > PriorityWeightPartition=10000 > PriorityWeightJobSize=1000 > PriorityMaxAge=1-0 > > TRES limits: > > We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set > grptresrunmin=cpu=1440000 — there might be a way of doing this at a higher > accounting level, but it works as is. > > We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and > set MaxJobsPerUser equal to our total GPU count. That helps prevent users > from queue-stuffing the GPUs even if they stay well below > the 1000 CPU-day TRES limit above. > > Backfill: > > SchedulerType=sched/backfill > > SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200 > > Can’t remember where I found the backfill guidance, but: > > - bf_window is set to our maximum job length (30 days) and bf_resolution is > set to 1.5 days. Most of our users’ jobs are well over 1 day. > - We have had users who didn’t use job arrays, and submitted a ton of small > jobs at once, thus bf_max_job_user gives the scheduler a chance to start up > to 80 jobs per user each cycle. This also prompted us to > increase default_queue_depth, so the backfill scheduler would examine more > jobs each cycle. > - bf_continue should let the backfill scheduler continue where it left off > if it gets interrupted, instead of having to start from scratch each time. > > I can guarantee you that our backfilling was sub-par until we tuned these > parameters (or at least a few users could find a way to submit so many jobs > that the backfill couldn’t keep up, even when we had idle > resources for their very short jobs). > > > On Jan 31, 2020, at 3:01 PM, David Baker <d.j.ba...@soton.ac.uk> wrote: > > > > External Email Warning > > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > > Hello, > > > > Thank you for your detailed reply. That’s all very useful. I manage to > mistype our cluster size since there are actually 450 standard compute, 40 > core, compute nodes. What you say is interesting and so it > concerns me that things are so bad at the moment, > > > > I wondered if you could please give me some more details of how you use > TRES to throttle user activity. We have applied some limits to throttle > users, however perhaps not enough or not well enough. So the > details of what you do would be really appreciated, please. > > > > In addition, we do use backfill, however we rarely see nodes being freed > up in the cluster to make way for high priority work which again concerns me. > If you could please share your backfill configuration then > that would be appreciated, please. > > > > Finally, which version of Slurm are you running? We are using an early > release of v18. > > > > Best regards, > > David > > > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Renfro, Michael <ren...@tntech.edu> > > Sent: 31 January 2020 17:23:05 > > To: Slurm User Community List <slurm-users@lists.schedmd.com> > > Subject: Re: [slurm-users] Longer queuing times for larger jobs > > > > I missed reading what size your cluster was at first, but found it on a > second read. Our cluster and typical maximum job size scales about the same > way, though (our users’ typical job size is anywhere from a > few cores up to 10% of our core count). > > > > There are several recommendations to separate your priority weights by an > order of magnitude or so. Our weights are dominated by fairshare, and we > effectively ignore all other factors. > > > > We also put TRES limits on by default, so that users can’t queue-stuff > beyond a certain limit (any jobs totaling under around 1 cluster-day can be > in a running or queued state, and anything past that is > ignored until their running jobs burn off some of their time). This allows > other users’ jobs to have a chance to run if resources are available, even if > they were submitted well after the heavy users’ blocked jobs. > > > > We also make extensive use of the backfill scheduler to run small, short > jobs earlier than their queue time might allow, if and only if they don’t > delay other jobs. If a particularly large job is about to run, we > can see the nodes gradually empty out, which opens up lots of capacity for > very short jobs. > > > > Overall, our average wait times since September 2017 haven’t exceeded 90 > hours for any job size, and I’m pretty sure a *lot* of that wait is due to a > few heavy users submitting large numbers of jobs far > beyond the TRES limit. Even our jobs of 5-10% cluster size have average > start times of 60 hours or less (and we've managed under 48 hours for those > size jobs for all but 2 months of that period), but those > larger jobs tend to be run by our lighter users, and they get a major > improvement to their queue time due to being far below their fairshare target. > > > > We’ve been running at >50% capacity since May 2018, and >60% capacity > since December 2018, and >80% capacity since February 2019. So our wait times > aren’t due to having a ton of spare capacity for > extended periods of time. > > > > Not sure how much of that will help immediately, but it may give you some > ideas. > > > > > On Jan 31, 2020, at 10:14 AM, David Baker <d.j.ba...@soton.ac.uk> wrote: > > > > > > External Email Warning > > > This email originated from outside the university. Please use caution > when opening attachments, clicking links, or responding to requests. > > > Hello, > > > > > > Thank you for your reply. in answer to Mike's questions... > > > > > > Our serial partition nodes are partially shared by the high memory > partition. That is, the partitions overlap partially -- shared nodes move one > way or another depending upon demand. Jobs requesting up > to and including 20 cores are routed to the serial queue. The serial nodes > are shared resources. In other words, jobs from different users can share the > nodes. The maximum time for serial jobs is 60 hours. > > > > > > Overtime there hasn't been any particular change in the time that users > are requesting. Likewise I'm convinced that the overall job size spread is > the same over time. What has changed is the increase in > the number of smaller jobs. That is, one node jobs that are exclusive (can't > be routed to the serial queue) or that require more then 20 cores, and also > jobs requesting up to 10/15 nodes (let's say). The user > base has increased dramatically over the last 6 months or so. > > > > > > This over population is leading to the delay in scheduling the larger > jobs. Given the size of the cluster we may need to make decisions regarding > which types of jobs we allow to "dominate" the system. The > larger jobs at the expense of the small fry for example, however that is a > difficult decision that means that someone has got to wait longer for > results.. > > > > > > Best regards, > > > David > > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Renfro, Michael <ren...@tntech.edu> > > > Sent: 31 January 2020 13:27 > > > To: Slurm User Community List <slurm-users@lists.schedmd.com> > > > Subject: Re: [slurm-users] Longer queuing times for larger jobs > > > > > > Greetings, fellow general university resource administrator. > > > > > > Couple things come to mind from my experience: > > > > > > 1) does your serial partition share nodes with the other non-serial > partitions? > > > > > > 2) what’s your maximum job time allowed, for serial (if the previous > answer was “yes”) and non-serial partitions? Are your users submitting > particularly longer jobs compared to earlier? > > > > > > 3) are you using the backfill scheduler at all? > > > > > > -- > > > Mike Renfro, PhD / HPC Systems Administrator, Information Technology > Services > > > 931 372-3601 / Tennessee Tech University > > > > > >> On Jan 31, 2020, at 6:23 AM, David Baker <d.j.ba...@soton.ac.uk> wrote: > > >> > > >> Hello, > > >> > > >> Our SLURM cluster is relatively small. We have 350 standard compute > nodes each with 40 cores. The largest job that users can run on the > partition is one requesting 32 nodes. Our cluster is a general > university research resource and so there are many different sizes of jobs > ranging from single core jobs, that get routed to a serial partition via the > job-submit.lua, through to jobs requesting 32 nodes. When we > first started the service, 32 node jobs were typically taking in the region > of 2 days to schedule -- recently queuing times have started to get out of > hand. Our setup is essentially... > > >> > > >> PriorityFavorSmall=NO > > >> FairShareDampeningFactor=5 > > >> PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE > > >> PriorityType=priority/multifactor > > >> PriorityDecayHalfLife=7-0 > > >> > > >> PriorityWeightAge=400000 > > >> PriorityWeightPartition=1000 > > >> PriorityWeightJobSize=500000 > > >> PriorityWeightQOS=1000000 > > >> PriorityMaxAge=7-0 > > >> > > >> To try to reduce the queuing times for our bigger jobs should we > potentially increase the PriorityWeightJobSize factor in the first instance > to bump up the priority of such jobs? Or should we potentially > define a set of QOSs which we assign to jobs in our job_submit.lua depending > on the size of the job. In other words, let's say there islarge QOS that give > the largest jobs a higher priority, and also limits how > many of those jobs that a single user can submit? > > >> > > >> Your advice would be appreciated, please. At the moment these large > jobs are not accruing a sufficiently high priority to rise above the other > jobs in the cluster. > > >> > > >> Best regards, > > >> David > > -- Dr. Loris Bennett (Mr.) ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de