Hi Reed, Reed Dier <reed.d...@focusvq.com> writes:
> On Jun 27, 2023, at 1:10 AM, Loris Bennett <loris.benn...@fu-berlin.de> > wrote: > > Hi Reed, > > Reed Dier <reed.d...@focusvq.com> writes: > > Is this an issue with the relative FIFO nature of the priority scheduling > currently with all of the other factors disabled, > or since my queue is fairly deep, is this due to bf_max_job_test being > the default 100, and it can’t look deep enough into the queue to find > a job that will fit into what is unoccupied? > > It could be that bf_max_job_test is too low. On our system some users > think it is a good idea to submit lots of jobs with identical resource > requirements by writing a loop around sbatch. Such jobs will exhaust > the bf_max_job_test very quickly. Thus we increased the limit to 1000 > and try to persuade users to use job arrays instead of home-grown loops. > This seem to work OK[1]. > > Cheers, > > Loris > > -- > Dr. Loris Bennett (Herr/Mr) > ZEDAT, Freie Universität Berlin > > Thanks Loris, > I think this will be the next knob to turn and gives a bit more confidence to > that, as we too have many such identical jobs. > > On Jun 26, 2023, at 9:10 PM, Brian Andrus <toomuc...@gmail.com> wrote: > > Reed, > > You may want to look at the timelimit aspect of the job(s). > > For one to 'squeeze in', it needs to be able to finish before the resources > in use are expected to become available. > > Consider: > Job A is running on 2 nodes of a 3 node cluster. It will finish in 1 hour. > Pending job B will run for 2 hours needs 2 nodes, but only 1 is free, it > waits. > Pending job C (with a lower priority) needs 1 node for 2 hours. Hmm, well it > won't finish before the time job B is expected to start, so it waits. > Pending job D (with even lower priority) needs 1 node for 30 minutes. That > can squeeze in before the additional node for Job B is expected to be > available, so it runs on the idle node. > > Brian Andrus > > Thanks Brian, > > Our layout is a bit less exciting, in that none of these are >1 node per job. > So the blocking out nodes for job:node Tetris isn’t really at play here. > The timing however is something I may turn an eye towards. > Most jobs have a “sanity” time limit applied, in that it is not so much an > expected time limit, but rather an “if it goes this long, something obviously > went > awry and we shouldn’t keep holding on to resources” limit. > So its a bit hard to quantify the timing portion, but I haven’t looked into > the slurm guesses of when it thinks the next task will start, etc. > > The pretty simplistic example at play here is that there are nodes that are > ~50-60% loaded for CPU and memory. > The next job up is a “whale” job that wants a ton of resources, cpu and/or > memory, but down the line there is a job with 2 cpu’s and 2 gb of memory > that can easily slot in to the unused resources. > > So my thinking was that the job_test list may be too short to actually get > that far down the queue to see that it could shove that job into some holes. You might also want to look at increasing bf_window to the maximum time limit, as suggested in 'man slurm.conf'. If backfill is not looking far enough into the future to know whether starting a job early will negatively impact a 'whale', then that 'whale' could potentially wait indefinitely. This is what happened on our system when we had a maximum runtime of 14 days but the 1 day default for bf_window. With both set to 14 days the problem was solved. Cheers, Loris > I’ll report back any findings after testing Loris’s suggestions. > > Appreciate everyone’s help and suggestions, > Reed > -- Dr. Loris Bennett (Herr/Mr) ZEDAT, Freie Universität Berlin