Hi Reed,

Reed Dier <reed.d...@focusvq.com> writes:

>  On Jun 27, 2023, at 1:10 AM, Loris Bennett <loris.benn...@fu-berlin.de> 
> wrote:
>
>  Hi Reed,
>
>  Reed Dier <reed.d...@focusvq.com> writes:
>
>  Is this an issue with the relative FIFO nature of the priority scheduling 
> currently with all of the other factors disabled,
>  or since my queue is fairly deep, is this due to bf_max_job_test being
>  the default 100, and it can’t look deep enough into the queue to find
>  a job that will fit into what is unoccupied?
>
>  It could be that bf_max_job_test is too low.  On our system some users
>  think it is a good idea to submit lots of jobs with identical resource
>  requirements by writing a loop around sbatch.  Such jobs will exhaust
>  the bf_max_job_test very quickly.  Thus we increased the limit to 1000
>  and try to persuade users to use job arrays instead of home-grown loops.
>  This seem to work OK[1].
>
>  Cheers,
>
>  Loris
>
>  -- 
>  Dr. Loris Bennett (Herr/Mr)
>  ZEDAT, Freie Universität Berlin
>
> Thanks Loris,
> I think this will be the next knob to turn and gives a bit more confidence to 
> that, as we too have many such identical jobs.
>
>  On Jun 26, 2023, at 9:10 PM, Brian Andrus <toomuc...@gmail.com> wrote:
>
>  Reed,
>
>  You may want to look at the timelimit aspect of the job(s).
>
>  For one to 'squeeze in', it needs to be able to finish before the resources 
> in use are expected to become available.
>
>  Consider:
>  Job A is running on 2 nodes of a 3 node cluster. It will finish in 1 hour.
>  Pending job B will run for 2 hours needs 2 nodes, but only 1 is free, it 
> waits.
>  Pending job C (with a lower priority) needs 1 node for 2 hours. Hmm, well it 
> won't finish before the time job B is expected to start, so it waits.
>  Pending job D (with even lower priority) needs 1 node for 30 minutes. That 
> can squeeze in before the additional node for Job B is expected to be
>  available, so it runs on the idle node.
>
>  Brian Andrus
>
> Thanks Brian,
>
> Our layout is a bit less exciting, in that none of these are >1 node per job.
> So the blocking out nodes for job:node Tetris isn’t really at play here.
> The timing however is something I may turn an eye towards.
> Most jobs have a “sanity” time limit applied, in that it is not so much an 
> expected time limit, but rather an “if it goes this long, something obviously 
> went
> awry and we shouldn’t keep holding on to resources” limit.
> So its a bit hard to quantify the timing portion, but I haven’t looked into 
> the slurm guesses of when it thinks the next task will start, etc.
>
> The pretty simplistic example at play here is that there are nodes that are 
> ~50-60% loaded for CPU and memory.
> The next job up is a “whale” job that wants a ton of resources, cpu and/or 
> memory, but down the line there is a job with 2 cpu’s and 2 gb of memory
> that can easily slot in to the unused resources.
>
> So my thinking was that the job_test list may be too short to actually get 
> that far down the queue to see that it could shove that job into some holes.

You might also want to look at increasing bf_window to the maximum time
limit, as suggested in 'man slurm.conf'.  If backfill is not looking far
enough into the future to know whether starting a job early will
negatively impact a 'whale', then that 'whale' could potentially wait
indefinitely.  This is what happened on our system when we had a maximum
runtime of 14 days but the 1 day default for bf_window.  With both set
to 14 days the problem was solved.

Cheers,

Loris

> I’ll report back any findings after testing Loris’s suggestions.
>
> Appreciate everyone’s help and suggestions,
> Reed
>
-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin

Reply via email to