Re: [DISCUSS] AIP-100: Eliminate Scheduler Starvation

Jarek Potiuk Thu, 14 May 2026 17:15:57 -0700

Hi Natanel,

I had a long conversation with my agent, who looked at the proposal,
Current code and earlier discussions....  I did review it and corrected it.
A few things might not be entirely accurate, of course. However
The final review report seems plausible. Proposal to take
baby steps and actually **measure** where we are today and whether
those baby steps will be enough to improve your experience seems
to be very good. So ... let me hand it off to what  my agent drafted:



Thanks for picking this up again -- the redrafted version is much more
focused than the previous attempt and I want to engage with it
constructively. A few code-grounded points below, several of which
reinforce what Jens raised in his reply yesterday.


** On Jens's "simple approach" -- it already exists **
Jens described what is essentially the outer loop currently in
`_executable_task_instances_to_queued` on `main`. I think it's worth
surfacing this in the AIP itself, because two of the building blocks
the proposal recommends are already partially in production -- which
changes the baseline the AIP has to beat.

In the code today:

- `for loop_count in itertools.count(start=1)` at
airflow-core/src/airflow/jobs/scheduler_job_runner.py:542
- four "starved" sets accumulating constraints discovered in the
Python pass (:526-538, populated in the per-TI loop at :692-858)
- fed back as NOT IN filters on the next iteration (:581-593)
- exit at :870-880 when either a TI was queued, the SQL returned
fewer rows than asked for (no more candidates), or no new filter
was learned.

Alongside that, a window function caps `max_active_tasks_per_dag` at
SQL level (:598-612) -- `row_number() OVER (PARTITION BY dag_id, run_id
ORDER BY -priority_weight, ...)` filtered by `row_num <=
dr_max_active_tasks`. A 100k-mapped-task run with `max_active_tasks=8`
contributes at most 8 rows to the batch, not 100k. So the "scheduler
fixates on one constrained group" pathology the AIP attributes to the
current design is already mitigated for that dimension.

What is *not* yet pushed into SQL: pool slots, executor slots,
`max_active_tis_per_dag`, `max_active_tis_per_dagrun`. The natural
incremental next step is to extend the same windowing / pre-filter
pattern to each of those -- which is exactly what PR #54103 does for
the pool dimension. Each limit moved from the Python loop (:689-858)
into SQL shrinks the batch the scheduler has to walk.


** Which means: benchmarks need to be against latest `main` **
Vikram and Jens both flagged the missing "hard facts." I would frame it
more specifically: the document is comparing against an old
characterization of the scheduler. A workload like Jens's pattern (4)
(medium-complex DAG with a 50-100k-task mapped task group) needs to be
benchmarked against current `main` -- with `max_active_tasks` capping
and the feedback loop active -- and ideally with the further
incremental SQL-side filters from #54103-style follow-ups, before the
case for scoring-formula redesign is convincing. The starvation gap is
shrinking iteration by iteration; the AIP needs to show what's left.


** Starvation is a query-shape problem, not a ranking problem **
If the loaded batch is 1024 TIs from a DAG already at some Python-side
limit (executor slots, `max_active_tis_per_dagrun`, etc.), no scoring
formula recovers throughput -- only SQL-level filtering does. Aging /
HRRN / urgency only changes which TI wins a contested slot *once*
candidates are loaded. So the recommended direction does not obviously
dominate the dismissed SQL-shape alternatives -- PR #54103 (single-
limit pre-filter, the natural next windowing step) and PR #55537 (1.7-
2x via SQL procedures) deserve to be the baseline the new design has to
beat, not a footnote.


** The scoring formula's inputs need granularity clarification **
`deadline` is actually first-class as of Airflow 3.0, via
`DAG(deadline=DeadlineAlert(...))`
(airflow-core/src/airflow/models/deadline.py, models/deadline_alert.py)
with references `DAGRUN_LOGICAL_DATE`, `DAGRUN_QUEUED_AT`,
`AVERAGE_RUNTIME(...)`, `FIXED_DATETIME(...)`. But the existing
mechanism is *DagRun-level* and fires a *callback on miss* via the
scheduler's per-loop check at scheduler_job_runner.py:1674-1693.
TaskInstance has no deadline column. The AIP's per-task
`Wu/(deadline-now-service)` term therefore needs either (a) a way to
derive a per-TI deadline from a DagRun-level `DeadlineAlert` -- non-
trivial for interval-based alerts, ambiguous when a DAG carries
multiple alerts -- or (b) a new per-TI deadline column with its own
user-facing API. `service_time` similarly overlaps with the existing
`AVERAGE_RUNTIME` reference's averaging machinery, but again at DagRun
granularity. Worth spelling out in the AIP which path is intended;
conflating "deadline as miss-callback trigger" with "deadline as
scheduling-urgency signal" is a substantive semantic shift.

Separately: HRRN is a preemptive-scheduler heuristic. Airflow's
scheduler is non-preemptive -- once a TI is queued, it runs. The queue-
selection window is short relative to task runtimes, so the value of
aging is mostly cosmetic.

** HA / critical-section concern **

The critical section runs under `pg_try_advisory_xact_lock` (:499) and
`with_row_locks(..., skip_locked=True)` (:642). Replacing
`order_by(-TI.priority_weight, ...)` with a computed expression over
multiple columns risks lengthening the held critical section and
hurting HA throughput. Worth measuring before committing to in-SQL
scoring.


** On terminology -- +1 to Jens **
In Airflow's state model "scheduled" already means "predecessors done,
eligible to be queued", and the function the AIP is rewriting is
`_executable_task_instances_to_queued`. Renaming "scheduling" to
"queueing" throughout the document would remove a lot of avoidable
confusion -- I had to re-read several sections to keep the terms
straight.


** HITL is separable **
This is the load-bearing observation behind Jens's `UPDATE
task_instance SET priority_weight = priority_weight + 100 WHERE ...`
question on the wiki: the mechanism for HITL priority override
basically exists today.

- `priority_weight` is a real column (taskinstance.py:544).
- The scheduler reads it directly when ranking
(scheduler_job_runner.py:575 / :603).
- So the SQL UPDATE on a scheduled TI does change which TI wins the
next pool slot.
- Caveat: `refresh_from_task()` (taskinstance.py:892) can recompute
it on DAG re-parse / map expansion, so it's racy.
- No REST/UI surface (`PatchTaskInstanceBody` only accepts
`new_state` and `note`).

So a supported HITL override is a two-file change -- a non-clobbered
`priority_override` column + a REST endpoint +
`order_by(-coalesce(TI.priority_override, TI.priority_weight), ...)`
-- that does not need any of the multi-factor scoring machinery to
exist. I would split it out of AIP-100 into its own small AIP, since
it can ship today and the scoring redesign is contingent on the
benchmark question above.


** Suggested split **
In the spirit of Jens's "requirements are entry conditions, not
results" framing -- the AIP currently bundles three different products
with different blast radii:

(a) Extending the existing windowing / feedback-loop pattern to the
remaining limits (pool slots, executor slots,
`max_active_tis_per_dag`, `max_active_tis_per_dagrun`).
Incremental, low risk, builds on what works, addresses the
actual starvation cases. PR #54103 is the template.

(b) HITL priority override -- separable as above.

(c) Multi-factor scoring (HRRN + deadline + service_time) --
speculative until (a) and (b) demonstrate it's still necessary,
and contingent on resolving the granularity question for
`deadline` and `service_time`.

Approving the bundle smuggles a verdict on each. Splitting lets (a)
and (b) ship while (c) gets the benchmark and granularity discussion
it needs.

* On the 1-2h call Jens proposed *

I think this would be productive -- if you and Jens want to organise
it, I would try to make it. Anchoring the agenda on "what's the gap
between current `main` (with windowing + feedback loop + PR
#54103-style incremental extensions) and pattern (4) workloads?"
would resolve a lot of the open questions.

Best,
Jarek

---
Drafted-by: Claude Code (Opus 4.7); reviewed by Jarek before sending

On Thu, May 14, 2026 at 11:34 PM Jens Scheffler <[email protected]> wrote:

> Hi,
>
> sorry after some time re-visited the document and must say... is a long
> read in a large complexity. Needed to read parts multiple times. Maybe
> it is excessively hard because over time it fragmented the chapters and
> approaches a bit. Maybe also due to this I am not really sold and
> similar like Vikram missing some "hard facts" and requirements. Yes I
> know we have limits and also in our environments we see sometimes
> starvations... but I am not sure if the proposed solution will fix it.
> Sometimes I feel like starvation is also on DB level or other limits? I
> am not sure.
>
> When I take a look to our production then where scheduler today also
> "starves" and is locked for a long time are a lot of other tasks  which
> are unrelated to queuing, like finding orphaned tasks, resetting them
> because of missing heartbeats or so. Applying Dag changes becuase of Dag
> version changes. In our production I mostly feel like if the scheduler
> would focus on scheduling tasks then it would get things done. I
> somethings thought making two instances "just" for scheduling Dag runs,
> two only for calculating if tasks are ready to get to scheduled state
> and keep some instances just for the queue processing. Would be even
> great to see where most time is spend. Also thought (if I would have
> time) adding more metrics to see more details where time is wasted.
>
> Anyway getting back to proposal: I am not sure. I think the write-up
> needs a bit of re-structuring.
>
> Reason: Assuming we can not have a "quick fix" with the current logic
> and before nailing down a new target logic I think we need to re-assess
> the requirements first. This is not really elaborated but some chapters
> make implict assumptions. E.g. we drop task level priorities. Can we?
> Such things are critical entry conditions to plan a target logic and
> optimize it.
>
> Similar with the HITL optimizing individial exceptions. This would be an
> new, cool additional requirement as game-changer which potentially
> influences the solution space.
>
> When updating the document I'd also propose to change all terms from
> "scheduling" in the document to "queueing". Took me another time a lot
> of confusion because in the state model of Airflow the state "scheduled"
> is used to mark that all predecessors are in the right state that a task
> can be queued. For me this is "scheduling". This is also a complex
> calculation but explicit is not scope of the paper. It is only talking
> about the step where tasks are put to "queued" state for Executor to
> pick-up. So "making tasks running/executing".
>
> Then I also have a bit of a problem with the elaboration about defining
> priority on Dag level. Whereas I assume that most use cases are OK to
> prioritize Dags and not individual tasks (but need to check if this is a
> requirement or not!) I am not sure how this would make the logic of
> scheduling easier. At the end entities of tasks need to be scheduled.
> Not Dags. If there is an algorithm that is a candidate then some details
> are missing to understand if this is an alternative. Otherwise the
> option does not be really considered.
>
> Also as I commented earlier I would like to understand which Dag and
> task volumes are considered. We know about the todays standard limit of
> 1024 in Dynamic Mapped tasks. With the change applied can this then be
> safely increased to 10k? 100k? How many "tasks per hour" do we assume to
> get scheduled? Will there be a metric of "tasks dropped out of
> scheduling" because of limits that we can see today and compare to
> future logic? So which factor is assumed to improve? How much CPU is
> needed for the N tasks per hour? Will logic scale linear that if I
> increase to N scheduler instances will it run N-times speed? Or only
> log(N)?
>
> I am not sure about other environments and if the logic helps in all use
> cases but I assume from our site I think we have 4 major different Dag
> layout patterns. Will the logic still be OK and improve all of them or
> do we assume a drawback for the benefit in scaling on the other side?
>
>  1. Simple Dag with 1-3 tasks, but thousands of runs scheduled per hour,
>     ~1000 runs concurrently (no problem today actually, see no starvation)
>  2. Simple Dag but Mapped Task with today 1024 -> Target 100k tasks in a
>     run. ~10 runs concurrently, maybe more queued (noproblem today up to
>     <~2000 mapped tasks)
>  3. Complex Dag with 10-500 tasks in a complex graph of dependency. Not
>     a large volume of mapped tasks. Many parallel runs (our Dags are not
>     too complex, other feedback welcome)
>  4. Medium complex Dag, 10-50 tasks in a Mapped Task Group. Mapping
>     desired to be 1000+ so in total 50-100k tasks in a single run. Maybe
>     10 runs (this is a severe problem when mapping is >100 already!)
>
> In case (2) and (4) we also saw failures in OOM as the scheduler
> attempted to load all tasks in memory. Not sure if this was a problem of
> queuing or maybe even before in getting to scheduled. Would therefore be
> good to clarify which cases are potential to improve.
>
> Some final comment: Have you also considered a simple approach that in
> the described "starvation" when the dropout rate is high (e.g. 90%) use
> the dropped Dags and Tasks / Pools that lead to dropout as additional
> select filter and query another round of tasks via `max_tis_per_query`
> and repeat until the dropout is below a threshold? That might extend the
> time for a scheduler loop but due to applied incremental filter improve
> effectiveness of tasks that are scheduled. As well as maximizes the net
> time scheduler spends on task queing compared to other things where net
> scheduler time is spend on if queuing is the limiting factor.
>
> While my write-up should not be treated as a rejection I think the
> AIP-100 needs to be a bit re-sorted to be understandable and convincing.
> So far I am not clear if the proposed solution is fixing the root.
>
> Not sure if there are other experts with other (general) opinions but
> not having this paper starving in review I could tink of having a 1-2h
> call to discuss in-person to sort things. I could offer to join to give
> some guidance if not others are on this already.
>
> Jens
>
> On 14.05.26 18:28, Jarek Potiuk wrote:
> > Hi Vikram,
> >
> > Natanel has reworked the approach, and it is receiving a much warmer
> > reception than before. This is Natanel's restart attempt, and there has
> > been no other discussion about it. While the "why" remains the
> > same—addressing starvation issues experienced by various users (including
> > Jens' team)—the previous proposals were rejected by Ash, Jens, and me due
> > to excessive complexity, such as building new algorithms via stored
> > procedures and such.
> >
> > The current proposal is simpler and introduces an interesting way to
> > prioritize SLA callbacks - whichis an interesting feature to have. While
> it
> > requires further scrutiny, it appears to be a direction we could
> > potentially accept.
> >
> > Best,
> > Jarek
> >
> > On Thu, May 14, 2026 at 4:50 PM Vikram Koka via dev<
> [email protected]>
> > wrote:
> >
> >> I probably missed the updates here, but when this was brought up at the
> dev
> >> call a while ago, I thought the response was a firm "No".
> >> Did I miss the follow-on updates?
> >>
> >> The discussion centered on the "why".
> >> This is such a "power user" feature, which is fine, but it lacks
> >> articulation of the projected benefit in quantifiable terms.
> >>
> >> It has very high technical complexity in the core of Airflow, which
> could
> >> cause massive ripple effects and therefore requires a cautious approach.
> >>
> >> I left a comment in the AIP document as well, about needing to
> understand
> >> the "why" better.
> >>
> >> Vikram
> >>
> >>
> >>
> >> On Thu, May 14, 2026 at 1:28 AM Christos Bisias<[email protected]>
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> I've got a question but I can't find a way to add a comment under the
> >> AIP.
> >>> The 'Weighted Aging and SLA Urgency' solves the starvation issue but do
> >> we
> >>> have an idea of what happens with the scheduler's performance?
> >>>
> >>> Christos
> >>>
> >>> On Thu, May 14, 2026 at 7:47 AM Elad Kalif<[email protected]> wrote:
> >>>
> >>>> Kind reminder to everyone who still wants to add comments
> >>>> If no further comments i think we can move this to a vote.
> >>>>
> >>>> On Tue, May 5, 2026 at 12:00 AM Jarek Potiuk<[email protected]> wrote:
> >>>>
> >>>>> Very interesting - also will take a look shortly - and great how it
> >>> seems
> >>>>> to tap into SLA urgency indeed..
> >>>>>
> >>>>> On Mon, May 4, 2026 at 10:08 PM Przemysław Mirowski <
> >> [email protected]
> >>>>> wrote:
> >>>>>
> >>>>>> +1 for Weighted Aging and SLA Urgency proposition.
> >>>>>> ________________________________
> >>>>>> From: Elad Kalif<[email protected]>
> >>>>>> Sent: 30 April 2026 12:08
> >>>>>> To:[email protected] <[email protected]>
> >>>>>> Subject: Re: [DISCUSS] AIP-100: Eliminate Scheduler Starvation
> >>>>>>
> >>>>>> I love the idea of dynamic priorities and I think this is a good
> >>>>> direction.
> >>>>>> On Thu, Apr 30, 2026 at 12:58 PM Ash Berlin-Taylor <[email protected]
> >>>>> wrote:
> >>>>>>> Thanks, this re-drafted version looks interesting. I’m trying to
> >>>>>>> internalise and understand the new proposal. I’l leave a few
> >>>>>>> comments/questions on the confluence page as I go.
> >>>>>>>
> >>>>>>> (I have to say, the move away from epic stored procedure def is a
> >>>>> welcome
> >>>>>>> change!)
> >>>>>>>
> >>>>>>> -ash
> >>>>>>>
> >>>>>>>> On 22 Apr 2026, at 14:45, Natanel<[email protected]>
> >>> wrote:
> >>>>>>>> Hello community, I want to spark again a discussion that was
> >> held
> >>>> in
> >>>>>> the
> >>>>>>>> past and delayed (due to the focus on releasing airflow 3.2),
> >> now
> >>>>> that
> >>>>>>> 3.2
> >>>>>>>> is released, I think it might be a good Idea to bring up the
> >>>>> discussion
> >>>>>>>> again.
> >>>>>>>>
> >>>>>>>> Wiki:
> >>>>>>>>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-100+Eliminate+Scheduler+Starvation+On+Concurrency+Limits
> >>>>>>>> In the current situation, tasks may starve in airflow, and in
> >>> large
> >>>>>> scale
> >>>>>>>> deployments (hundreds of thousands of tasks (or more) per day)
> >> we
> >>>>> tend
> >>>>>> to
> >>>>>>>> experience severe starvation, where a group of tasks may starve
> >>>> other
> >>>>>>>> tasks, not allowing them to run, as described in the wiki.
> >>>>>>>>
> >>>>>>>> After the february devcall, where I have proposed the AIP, a
> >> few
> >>>>>> comments
> >>>>>>>> have arised, and so I had begun to research again about
> >> different
> >>>>>>>> scheduling algorithms and I have added to the considerations as
> >>>> part
> >>>>> of
> >>>>>>> the
> >>>>>>>> AIP.
> >>>>>>>>
> >>>>>>>> As of now, the state of the AIP is where there are a few ideas
> >>>>> proposed
> >>>>>>>> (some of which are pretty similar to each other, while others
> >> are
> >>>>> quite
> >>>>>>>> different), as the main concern from the devcall was that the
> >>>>>> approaches
> >>>>>>>> given might not be the best way to solve the issue, as it is a
> >>> very
> >>>>>> hard
> >>>>>>>> problem to solve.
> >>>>>>>>
> >>>>>>>> After that, I have made some edits to the AIP and to the
> >>>>> propositions,
> >>>>>> in
> >>>>>>>> order to help decide and clarify the advantages and
> >> disadvantages
> >>>> of
> >>>>>> each
> >>>>>>>> approach.
> >>>>>>>> The current "best approach" can be found here here
> >>>>>>>> <
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618462#AIP100EliminateSchedulerStarvationOnConcurrencyLimits-Currentbestproposition
> >>>>>>>> ,
> >>>>>>>> where the new proposed algorithms are defined here
> >>>>>>>> <
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618462#AIP100EliminateSchedulerStarvationOnConcurrencyLimits-Othernonagingalgorithmsother_algs
> >>>>>>>> .
> >>>>>>>>
> >>>>>>>> In order to continue with the effort, a community consensus
> >> needs
> >>>> to
> >>>>> be
> >>>>>>>> reached about the preferred solution/solutions, once this is
> >>> done,
> >>>> it
> >>>>>> is
> >>>>>>>> possible to go on and implement + stress test the proposed
> >>>>> solution/s.
> >>>>>>>> I would appreciate a review from community members, moreover, I
> >>>> would
> >>>>>>> also
> >>>>>>>> appreciate any new propositions or improvements which can be
> >>> done.
> >>>>>>>> Best regards
> >>>>>>>> Natanel.
> >>>>>>>
> >>>>>>>
> >>> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail:[email protected]
> >>>>>>> For additional commands, e-mail:[email protected]
> >>>>>>>
> >>>>>>>

Re: [DISCUSS] AIP-100: Eliminate Scheduler Starvation

Reply via email to