Thanks Jarek.  Let me respond to your comments.  Then I'll follow up with
another message trying to focus the discussion on my question.


> I think it should be controllable at the moment when you start backfill.


Current state is, when you create a backfill you set max_active_runs for
it.  And this is completely independent of the DAG level max_active_runs.
And you can change Backfill.max_active_runs in flight -- this will just
control how many new dag runs can be started.  So e.g. if you reduce
concurrency below num active running, it will let the running ones complete.

IMHO default behaviour should be that there should be very little
> concurrency reserved for backfills so that you could run backfills without
> impacting regular runs - say "max 10 backfill task instances" and "max 3
> backfill runs".


Current state is user gets to choose.  I think I put a default of 10
concurrent runs. But it's up to the user when they create the backfill.  I
do *not* have any concurrency control on concurrent TIs for backfill.
That's controlled through dag settings.

But then - you should be able to address special case when you want to
> prioritise the backfills - and in certain cases even starve the regular
> runs because you REALLY need to backfill old data asap  - and there you
> should be able to override the max for specific backfill instance.


Having a separate max_active_runs for Backfill accomplishes this more or
less.  You could set DAG.max_active_runs to 0 (assuming we even allow
that?) but set it to 10 for backfill and then only backfills would run.

But one small nuance, on a related topic, in scheduler "which dag runs
should i process next" queries, I currently sort backfill dag runs below
other runs so, in terms of being processed by the scheduler, "normal" dag
runs have priority for being processed by the scheduler.  But that's a
different kind of and the point there is to protect "prod" jobs from being
starved out by backfill.  And, probably we will need to tweak the logic
over time based on performance in the wild.

Ok, having responded to your thoughts, I will follow up with another
message trying to focus the discussion.


On Thu, Oct 3, 2024 at 8:57 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> I think it should be controllable at the moment when you start backfill.
>
> IMHO default behaviour should be that there should be very little
> concurrency reserved for backfills so that you could run backfills without
> impacting regular runs - say "max 10 backfill task instances" and "max 3
> backfill runs".
>
> But then - you should be able to address special case when you want to
> prioritise the backfills - and in certain cases even starve the regular
> runs because you REALLY need to backfill old data asap  - and there you
> should be able to override the max for specific backfill instance.
>
> J.
>
>
> On Thu, Oct 3, 2024 at 8:16 PM Daniel Standish
> <daniel.stand...@astronomer.io.invalid> wrote:
>
> > Just adding the [DISCUSS] prefix, which I forgot to add.
> >
> > On Thu, Oct 3, 2024 at 4:23 PM Daniel Standish <
> > daniel.stand...@astronomer.io> wrote:
> >
> > > Ok so, I'm thinking through what makes sense re concurrency control in
> > > backfill.
> > >
> > > It was referred to
> > > <
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=311627729#AIP78Schedulermanagedbackfill-Otherideasunderconsideration
> > >
> > > in the AIP but I didn't define the behavior:
> > >
> > > Other ideas under consideration
> > >>
> > >>    - Add extra concurrency control on dag run
> > >>
> > >>
> > >>    - Apply max active dag runs separately for backfill
> > >>
> > >>
> > >>    - Override any dag param in creating the backfill job and it’s only
> > >>    applied in that scope
> > >>
> > >>
> > >>
> > > As I have proceeded with implementation, here's what I went with:
> > >
> > > Each "backfill" gets its own concurrency control ("max_active_runs")
> that
> > > is evaluated completely separate from the DAG scope max_active_runs
> > >
> > > So if DAG max active runs is 2, and the backfill max active runs is 1,
> > > then you can have max of 3 concurrent runs.  Your non-backfill dags
> > cannot
> > > starve out the backfill ones, and backfill dag runs cannot starve out
> the
> > > non-backfill ones.
> > >
> > > The other way to go is to say that DAG.max_active_runs is global.  This
> > > does not feel quite right to me cus it gets a bit murky.  E.g. what
> > happens
> > > if DAG.max is 10 and Backfill.max is 10.  Do you allow it?  What do you
> > do
> > > to avoid starving out non-backfill runs?
> > >
> > > What do people think?  Relevant PR is here
> > > <https://github.com/apache/airflow/pull/42686>.
> > >
> >
>

Reply via email to