Hello David,

David Baker <d.j.ba...@soton.ac.uk> writes:

> Hello,
>
> I've taken a very good look at our cluster, however as yet not made
> any significant changes. The one change that I did make was to
> increase the "jobsizeweight". That's now our dominant parameter and it
> does ensure that our largest jobs (> 20 nodes) are making it to the
> top of the sprio listing which is what we want to see.
>
> These large jobs aren't making an progress despite the priority
> lift. I additionally decreased the nice value of the job that sparked
> this discussion. That is (looking at at sprio) there is a 32 node job
> with a very high priority...
>
> JOBID PARTITION     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  
> PARTITION        QOS        NICE
> 280919 batch      mep1c10    1275481     400000      59827     415655         
>  0                  0     -400000
>
> That job has been sitting in the queue for well over a week and it is
> disconcerting that we never see nodes becoming idle in order to
> service these large jobs. Nodes do become idle and then get scooped by
> jobs started by backfill. Looking at the slurmctld logs I see that the
> vast majority of jobs are being started via backfill -- including, for
> example, a 24 node job. I see very few jobs allocated by the
> scheduler. That is, messages like sched: Allocate JobId=296915 are few
> and far between and I never see any of the large jobs being allocated
> in the batch queue.
>
> Surely, this is not correct, however does anyone have any advice on
> what to check, please?

Have you looked at what 'sprio' says?  I usually want to see the list
sorted by priority and so call it like this:

  sprio -l -S "%Y"

If you run

 scontrol show job <jobid>

is the entry 'NodeList' ever anything other than '(null)'?

Cheers,

Loris

> Best regards,
> David
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of 
> Killian Murphy <killian.mur...@york.ac.uk>
> Sent: 04 February 2020 10:48
> To: Slurm User Community List <slurm-users@lists.schedmd.com>
> Subject: Re: [slurm-users] Longer queuing times for larger jobs 
>  
> Hi David. 
>
> I'd love to hear back about the changes that you make and how they affect the 
> performance of your scheduler.
>
> Any chance you could let us know how things go?
>
> Killian
>
> On Tue, 4 Feb 2020 at 10:43, David Baker <d.j.ba...@soton.ac.uk> wrote:
>
>  Hello,
>
>  Thank you very much again for your comments and the details of your slurm 
> configuration. All the information is really useful. We are working on our 
> cluster right now and making some appropriate changes.
>  We'll see how we get on over the next 24 hours or so.
>
>  Best regards,
>  David
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of 
> Renfro, Michael <ren...@tntech.edu>
>  Sent: 31 January 2020 22:08
>  To: Slurm User Community List <slurm-users@lists.schedmd.com>
>  Subject: Re: [slurm-users] Longer queuing times for larger jobs 
>   
>  Slurm 19.05 now, though all these settings were in effect on 17.02 until 
> quite recently. If I get some detail wrong below, I hope someone will correct 
> me. But this is our current working state. We’ve been able to
>  schedule 10-20k jobs per month since late 2017, and we successfully 
> scheduled 320k jobs over December and January (largely due to one user using 
> some form of automated submission for very short jobs).
>
>  Basic scheduler setup:
>
>  As I’d said previously, we prioritize on fairshare almost exclusively. Most 
> of our jobs (molecular dynamics, CFD) end up in a single batch partition, 
> since GPU and big-memory jobs have other partitions.
>
>  SelectType=select/cons_res
>  SelectTypeParameters=CR_Core_Memory
>  PriorityType=priority/multifactor
>  PriorityDecayHalfLife=14-0
>  PriorityWeightFairshare=100000
>  PriorityWeightAge=1000
>  PriorityWeightPartition=10000
>  PriorityWeightJobSize=1000
>  PriorityMaxAge=1-0
>
>  TRES limits:
>
>  We’ve limited users to 1000 CPU-days with: sacctmgr modify user someuser set 
> grptresrunmin=cpu=1440000 — there might be a way of doing this at a higher 
> accounting level, but it works as is.
>
>  We also force QoS=gpu in each GPU partition’s definition in slurm.conf, and 
> set MaxJobsPerUser equal to our total GPU count. That helps prevent users 
> from queue-stuffing the GPUs even if they stay well below
>  the 1000 CPU-day TRES limit above.
>
>  Backfill:
>
>    SchedulerType=sched/backfill
>    
> SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200
>
>  Can’t remember where I found the backfill guidance, but:
>
>  - bf_window is set to our maximum job length (30 days) and bf_resolution is 
> set to 1.5 days. Most of our users’ jobs are well over 1 day.
>  - We have had users who didn’t use job arrays, and submitted a ton of small 
> jobs at once, thus bf_max_job_user gives the scheduler a chance to start up 
> to 80 jobs per user each cycle. This also prompted us to
>  increase default_queue_depth, so the backfill scheduler would examine more 
> jobs each cycle.
>  - bf_continue should let the backfill scheduler continue where it left off 
> if it gets interrupted, instead of having to start from scratch each time.
>
>  I can guarantee you that our backfilling was sub-par until we tuned these 
> parameters (or at least a few users could find a way to submit so many jobs 
> that the backfill couldn’t keep up, even when we had idle
>  resources for their very short jobs).
>
>  > On Jan 31, 2020, at 3:01 PM, David Baker <d.j.ba...@soton.ac.uk> wrote:
>  > 
>  > External Email Warning
>  > This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
>  > Hello,
>  > 
>  > Thank you for your detailed reply. That’s all very useful. I manage to 
> mistype our cluster size since there are actually 450 standard compute, 40 
> core, compute nodes. What you say is interesting and so it
>  concerns me that things are so bad at the moment,
>  > 
>  > I wondered if you could please give me some more details of how you use 
> TRES to throttle user activity. We have applied some limits to throttle 
> users, however perhaps not enough or not well enough. So the
>  details of what you do would be really appreciated, please.
>  > 
>  > In addition, we do use backfill, however we rarely see nodes being freed 
> up in the cluster to make way for high priority work which again concerns me. 
> If you could please share your backfill configuration then
>  that would be appreciated, please.
>  > 
>  > Finally, which version of Slurm are you running? We are using an early 
> release of v18.
>  > 
>  > Best regards,
>  > David
>  > 
>  > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of 
> Renfro, Michael <ren...@tntech.edu>
>  > Sent: 31 January 2020 17:23:05
>  > To: Slurm User Community List <slurm-users@lists.schedmd.com>
>  > Subject: Re: [slurm-users] Longer queuing times for larger jobs
>  >  
>  > I missed reading what size your cluster was at first, but found it on a 
> second read. Our cluster and typical maximum job size scales about the same 
> way, though (our users’ typical job size is anywhere from a
>  few cores up to 10% of our core count).
>  > 
>  > There are several recommendations to separate your priority weights by an 
> order of magnitude or so. Our weights are dominated by fairshare, and we 
> effectively ignore all other factors.
>  > 
>  > We also put TRES limits on by default, so that users can’t queue-stuff 
> beyond a certain limit (any jobs totaling under around 1 cluster-day can be 
> in a running or queued state, and anything past that is
>  ignored until their running jobs burn off some of their time). This allows 
> other users’ jobs to have a chance to run if resources are available, even if 
> they were submitted well after the heavy users’ blocked jobs.
>  > 
>  > We also make extensive use of the backfill scheduler to run small, short 
> jobs earlier than their queue time might allow, if and only if they don’t 
> delay other jobs. If a particularly large job is about to run, we
>  can see the nodes gradually empty out, which opens up lots of capacity for 
> very short jobs.
>  > 
>  > Overall, our average wait times since September 2017 haven’t exceeded 90 
> hours for any job size, and I’m pretty sure a *lot* of that wait is due to a 
> few heavy users submitting large numbers of jobs far
>  beyond the TRES limit. Even our jobs of 5-10% cluster size have average 
> start times of 60 hours or less (and we've managed under 48 hours for those 
> size jobs for all but 2 months of that period), but those
>  larger jobs tend to be run by our lighter users, and they get a major 
> improvement to their queue time due to being far below their fairshare target.
>  > 
>  > We’ve been running at >50% capacity since May 2018, and >60% capacity 
> since December 2018, and >80% capacity since February 2019. So our wait times 
> aren’t due to having a ton of spare capacity for
>  extended periods of time.
>  > 
>  > Not sure how much of that will help immediately, but it may give you some 
> ideas.
>  > 
>  > > On Jan 31, 2020, at 10:14 AM, David Baker <d.j.ba...@soton.ac.uk> wrote:
>  > > 
>  > > External Email Warning
>  > > This email originated from outside the university. Please use caution 
> when opening attachments, clicking links, or responding to requests.
>  > > Hello,
>  > > 
>  > > Thank you for your reply. in answer to Mike's questions...
>  > > 
>  > > Our serial partition nodes are partially shared by the high memory 
> partition. That is, the partitions overlap partially -- shared nodes move one 
> way or another depending upon demand. Jobs requesting up
>  to and including 20 cores are routed to the serial queue. The serial nodes 
> are shared resources. In other words, jobs from different users can share the 
> nodes. The maximum time for serial jobs is 60 hours. 
>  > > 
>  > > Overtime there hasn't been any particular change in the time that users 
> are requesting. Likewise I'm convinced that the overall job size spread is 
> the same over time. What has changed is the increase in
>  the number of smaller jobs. That is, one node jobs that are exclusive (can't 
> be routed to the serial queue) or that require more then 20 cores, and also 
> jobs requesting up to 10/15 nodes (let's say). The user
>  base has increased dramatically over the last 6 months or so. 
>  > > 
>  > > This over population is leading to the delay in scheduling the larger 
> jobs. Given the size of the cluster we may need to make decisions regarding 
> which types of jobs we allow to "dominate" the system. The
>  larger jobs at the expense of the small fry for example,  however that is a 
> difficult decision that means that someone has got to wait longer for 
> results..
>  > > 
>  > > Best regards,
>  > > David
>  > > From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of 
> Renfro, Michael <ren...@tntech.edu>
>  > > Sent: 31 January 2020 13:27
>  > > To: Slurm User Community List <slurm-users@lists.schedmd.com>
>  > > Subject: Re: [slurm-users] Longer queuing times for larger jobs
>  > >  
>  > > Greetings, fellow general university resource administrator.
>  > > 
>  > > Couple things come to mind from my experience:
>  > > 
>  > > 1) does your serial partition share nodes with the other non-serial 
> partitions?
>  > > 
>  > > 2) what’s your maximum job time allowed, for serial (if the previous 
> answer was “yes”) and non-serial partitions? Are your users submitting 
> particularly longer jobs compared to earlier?
>  > > 
>  > > 3) are you using the backfill scheduler at all?
>  > > 
>  > > --
>  > > Mike Renfro, PhD  / HPC Systems Administrator, Information Technology 
> Services
>  > > 931 372-3601      / Tennessee Tech University
>  > > 
>  > >> On Jan 31, 2020, at 6:23 AM, David Baker <d.j.ba...@soton.ac.uk> wrote:
>  > >> 
>  > >> Hello,
>  > >> 
>  > >> Our SLURM cluster is relatively small. We have 350 standard compute 
> nodes each with 40 cores. The largest job that users  can run on the 
> partition is one requesting 32 nodes. Our cluster is a general
>  university research resource and so there are many different sizes of jobs 
> ranging from single core jobs, that get routed to a serial partition via the 
> job-submit.lua, through to jobs requesting 32 nodes. When we
>  first started the service, 32 node jobs were typically taking in the region 
> of 2 days to schedule -- recently queuing times have started to get out of 
> hand. Our setup is essentially...
>  > >> 
>  > >> PriorityFavorSmall=NO
>  > >> FairShareDampeningFactor=5
>  > >> PriorityFlags=ACCRUE_ALWAYS,FAIR_TREE
>  > >> PriorityType=priority/multifactor
>  > >> PriorityDecayHalfLife=7-0
>  > >> 
>  > >> PriorityWeightAge=400000
>  > >> PriorityWeightPartition=1000
>  > >> PriorityWeightJobSize=500000
>  > >> PriorityWeightQOS=1000000
>  > >> PriorityMaxAge=7-0
>  > >> 
>  > >> To try to reduce the queuing times for our bigger jobs should we 
> potentially increase the PriorityWeightJobSize factor in the first instance 
> to bump up the priority of such jobs? Or should we potentially
>  define a set of QOSs which we assign to jobs in our job_submit.lua depending 
> on the size of the job. In other words, let's say there islarge QOS that give 
> the largest jobs a higher priority, and also limits how
>  many of those jobs that a single user can submit?
>  > >> 
>  > >> Your advice would be appreciated, please. At the moment these large 
> jobs are not accruing a sufficiently high priority to rise above the other 
> jobs in the cluster.
>  > >> 
>  > >> Best regards,
>  > >> David 
>  > 
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.benn...@fu-berlin.de

Reply via email to