Magnus Jonsson <mag...@hpc2n.umu.se> writes:

> Jobs are not backfilled due to the fact that backfill does not finish the
> complete backlog of jobs in the queue before it's interrupted and starts all
> over again. We sometimes have lots of jobs in the queue of various sizes and
> users and even with idle nodes short job will not start because of this.

We have what seems to be a similar type of load, and have in periods
experienced the same problem.

There are some parameters that can be used to tune the backfiller.

We have had good results with setting bf_max_job_user to a small value
(between 5 and 10), and bf_resolution to a large value (around 3600).

bf_max_job_user is similar to Maui MAXIJOB limit; the backfiller will
only try this many jobs for each user.  This is especially useful if
some users have many identical or nearly identical jobs in the queue.

bf_resolution is the time resolution (in seconds) of the time slots used
for estimating when a job can start.  The default, 60 seconds, was way
to low for us.

> I have made a patch for backfill with a configuration option (bf_continue) to
> let backfill continue from the last JobID of the last cycle.
>
> This will make backfill look at the whole queue eventually.

Interesting.  We will take a look at this.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Reply via email to