Re: [DISCUSS] AIP-100: Eliminate Scheduler Starvation

Vikram Koka via dev Sun, 17 May 2026 07:25:58 -0700

Thanks Jens and Jarek for the detailed analysis. The thread has helped
surface the right questions.






Before we discuss solutions, whether incremental SQL pre-filtering, scoring
formulas, or HITL priority overrides, I think we need to go back to basics:

*what specific starvation scenarios are we solving, and where is the
evidence that they exist on current main?*





The AIP asserts that scheduler starvation is a problem, but I haven't seen:

- A concrete workload definition where starvation occurs (which Dag
patterns, what concurrency settings, what pool configuration)

- A reproduction on current main that demonstrates tasks being starved,
with metrics showing how long and how severely

- An analysis of whether the existing mitigation mechanisms (the
itertools.count re-scan loop, the "starved" NOT IN filters, the
row_number(), windowing) are actually failing, and if so, why


Without this foundation, we're designing solutions to a problem that may
not exist in practice, or that may already be adequately handled by
the mechanisms
Jarek described.


Additionally, AIP-98 (async support for PythonOperator, shipped in 3.2.0)
already (at least partially) addresses two of the four workload patterns
Jens identified:

- Patterns (2) and (4), both based on Dynamic Task Mapping. Scenarios that
previously generated thousands of individually scheduled mapped task
instances

- These can now run as a single async task on a worker, removing that load
from the scheduler entirely.

- Any starvation analysis needs to account for mitigations that have
already shipped before concluding that new scheduler logic is needed.


My recommendation:

- Start with benchmarks.

- Use Jens's four Dag layout patterns as a framework, run them against
current main with realistic concurrency and pool settings, and show us
where starvation actually occurs.


Let's use the revealed data to have a much more focused discussion about
where and how to address it.


Best regards,

Vikram



On Fri, May 15, 2026 at 2:16 AM Jarek Potiuk <[email protected]> wrote:

> Hi Natanel,
>
> I had a long conversation with my agent, who looked at the proposal,
> Current code and earlier discussions....  I did review it and corrected it.
> A few things might not be entirely accurate, of course. However
> The final review report seems plausible. Proposal to take
> baby steps and actually **measure** where we are today and whether
> those baby steps will be enough to improve your experience seems
> to be very good. So ... let me hand it off to what  my agent drafted:
>
>
> Thanks for picking this up again -- the redrafted version is much more
> focused than the previous attempt and I want to engage with it
> constructively. A few code-grounded points below, several of which
> reinforce what Jens raised in his reply yesterday.
>
>
> ** On Jens's "simple approach" -- it already exists **
> Jens described what is essentially the outer loop currently in
> `_executable_task_instances_to_queued` on `main`. I think it's worth
> surfacing this in the AIP itself, because two of the building blocks
> the proposal recommends are already partially in production -- which
> changes the baseline the AIP has to beat.
>
> In the code today:
>
> - `for loop_count in itertools.count(start=1)` at
> airflow-core/src/airflow/jobs/scheduler_job_runner.py:542
> - four "starved" sets accumulating constraints discovered in the
> Python pass (:526-538, populated in the per-TI loop at :692-858)
> - fed back as NOT IN filters on the next iteration (:581-593)
> - exit at :870-880 when either a TI was queued, the SQL returned
> fewer rows than asked for (no more candidates), or no new filter
> was learned.
>
> Alongside that, a window function caps `max_active_tasks_per_dag` at
> SQL level (:598-612) -- `row_number() OVER (PARTITION BY dag_id, run_id
> ORDER BY -priority_weight, ...)` filtered by `row_num <=
> dr_max_active_tasks`. A 100k-mapped-task run with `max_active_tasks=8`
> contributes at most 8 rows to the batch, not 100k. So the "scheduler
> fixates on one constrained group" pathology the AIP attributes to the
> current design is already mitigated for that dimension.
>
> What is *not* yet pushed into SQL: pool slots, executor slots,
> `max_active_tis_per_dag`, `max_active_tis_per_dagrun`. The natural
> incremental next step is to extend the same windowing / pre-filter
> pattern to each of those -- which is exactly what PR #54103 does for
> the pool dimension. Each limit moved from the Python loop (:689-858)
> into SQL shrinks the batch the scheduler has to walk.
>
>
> ** Which means: benchmarks need to be against latest `main` **
> Vikram and Jens both flagged the missing "hard facts." I would frame it
> more specifically: the document is comparing against an old
> characterization of the scheduler. A workload like Jens's pattern (4)
> (medium-complex DAG with a 50-100k-task mapped task group) needs to be
> benchmarked against current `main` -- with `max_active_tasks` capping
> and the feedback loop active -- and ideally with the further
> incremental SQL-side filters from #54103-style follow-ups, before the
> case for scoring-formula redesign is convincing. The starvation gap is
> shrinking iteration by iteration; the AIP needs to show what's left.
>
>
> ** Starvation is a query-shape problem, not a ranking problem **
> If the loaded batch is 1024 TIs from a DAG already at some Python-side
> limit (executor slots, `max_active_tis_per_dagrun`, etc.), no scoring
> formula recovers throughput -- only SQL-level filtering does. Aging /
> HRRN / urgency only changes which TI wins a contested slot *once*
> candidates are loaded. So the recommended direction does not obviously
> dominate the dismissed SQL-shape alternatives -- PR #54103 (single-
> limit pre-filter, the natural next windowing step) and PR #55537 (1.7-
> 2x via SQL procedures) deserve to be the baseline the new design has to
> beat, not a footnote.
>
>
> ** The scoring formula's inputs need granularity clarification **
> `deadline` is actually first-class as of Airflow 3.0, via
> `DAG(deadline=DeadlineAlert(...))`
> (airflow-core/src/airflow/models/deadline.py, models/deadline_alert.py)
> with references `DAGRUN_LOGICAL_DATE`, `DAGRUN_QUEUED_AT`,
> `AVERAGE_RUNTIME(...)`, `FIXED_DATETIME(...)`. But the existing
> mechanism is *DagRun-level* and fires a *callback on miss* via the
> scheduler's per-loop check at scheduler_job_runner.py:1674-1693.
> TaskInstance has no deadline column. The AIP's per-task
> `Wu/(deadline-now-service)` term therefore needs either (a) a way to
> derive a per-TI deadline from a DagRun-level `DeadlineAlert` -- non-
> trivial for interval-based alerts, ambiguous when a DAG carries
> multiple alerts -- or (b) a new per-TI deadline column with its own
> user-facing API. `service_time` similarly overlaps with the existing
> `AVERAGE_RUNTIME` reference's averaging machinery, but again at DagRun
> granularity. Worth spelling out in the AIP which path is intended;
> conflating "deadline as miss-callback trigger" with "deadline as
> scheduling-urgency signal" is a substantive semantic shift.
>
> Separately: HRRN is a preemptive-scheduler heuristic. Airflow's
> scheduler is non-preemptive -- once a TI is queued, it runs. The queue-
> selection window is short relative to task runtimes, so the value of
> aging is mostly cosmetic.
>
> ** HA / critical-section concern **
>
> The critical section runs under `pg_try_advisory_xact_lock` (:499) and
> `with_row_locks(..., skip_locked=True)` (:642). Replacing
> `order_by(-TI.priority_weight, ...)` with a computed expression over
> multiple columns risks lengthening the held critical section and
> hurting HA throughput. Worth measuring before committing to in-SQL
> scoring.
>
>
> ** On terminology -- +1 to Jens **
> In Airflow's state model "scheduled" already means "predecessors done,
> eligible to be queued", and the function the AIP is rewriting is
> `_executable_task_instances_to_queued`. Renaming "scheduling" to
> "queueing" throughout the document would remove a lot of avoidable
> confusion -- I had to re-read several sections to keep the terms
> straight.
>
>
> ** HITL is separable **
> This is the load-bearing observation behind Jens's `UPDATE
> task_instance SET priority_weight = priority_weight + 100 WHERE ...`
> question on the wiki: the mechanism for HITL priority override
> basically exists today.
>
> - `priority_weight` is a real column (taskinstance.py:544).
> - The scheduler reads it directly when ranking
> (scheduler_job_runner.py:575 / :603).
> - So the SQL UPDATE on a scheduled TI does change which TI wins the
> next pool slot.
> - Caveat: `refresh_from_task()` (taskinstance.py:892) can recompute
> it on DAG re-parse / map expansion, so it's racy.
> - No REST/UI surface (`PatchTaskInstanceBody` only accepts
> `new_state` and `note`).
>
> So a supported HITL override is a two-file change -- a non-clobbered
> `priority_override` column + a REST endpoint +
> `order_by(-coalesce(TI.priority_override, TI.priority_weight), ...)`
> -- that does not need any of the multi-factor scoring machinery to
> exist. I would split it out of AIP-100 into its own small AIP, since
> it can ship today and the scoring redesign is contingent on the
> benchmark question above.
>
>
> ** Suggested split **
> In the spirit of Jens's "requirements are entry conditions, not
> results" framing -- the AIP currently bundles three different products
> with different blast radii:
>
> (a) Extending the existing windowing / feedback-loop pattern to the
> remaining limits (pool slots, executor slots,
> `max_active_tis_per_dag`, `max_active_tis_per_dagrun`).
> Incremental, low risk, builds on what works, addresses the
> actual starvation cases. PR #54103 is the template.
>
> (b) HITL priority override -- separable as above.
>
> (c) Multi-factor scoring (HRRN + deadline + service_time) --
> speculative until (a) and (b) demonstrate it's still necessary,
> and contingent on resolving the granularity question for
> `deadline` and `service_time`.
>
> Approving the bundle smuggles a verdict on each. Splitting lets (a)
> and (b) ship while (c) gets the benchmark and granularity discussion
> it needs.
>
> * On the 1-2h call Jens proposed *
>
> I think this would be productive -- if you and Jens want to organise
> it, I would try to make it. Anchoring the agenda on "what's the gap
> between current `main` (with windowing + feedback loop + PR
> #54103-style incremental extensions) and pattern (4) workloads?"
> would resolve a lot of the open questions.
>
> Best,
> Jarek
>
> ---
> Drafted-by: Claude Code (Opus 4.7); reviewed by Jarek before sending
>
> On Thu, May 14, 2026 at 11:34 PM Jens Scheffler <[email protected]>
> wrote:
>
> > Hi,
> >
> > sorry after some time re-visited the document and must say... is a long
> > read in a large complexity. Needed to read parts multiple times. Maybe
> > it is excessively hard because over time it fragmented the chapters and
> > approaches a bit. Maybe also due to this I am not really sold and
> > similar like Vikram missing some "hard facts" and requirements. Yes I
> > know we have limits and also in our environments we see sometimes
> > starvations... but I am not sure if the proposed solution will fix it.
> > Sometimes I feel like starvation is also on DB level or other limits? I
> > am not sure.
> >
> > When I take a look to our production then where scheduler today also
> > "starves" and is locked for a long time are a lot of other tasks  which
> > are unrelated to queuing, like finding orphaned tasks, resetting them
> > because of missing heartbeats or so. Applying Dag changes becuase of Dag
> > version changes. In our production I mostly feel like if the scheduler
> > would focus on scheduling tasks then it would get things done. I
> > somethings thought making two instances "just" for scheduling Dag runs,
> > two only for calculating if tasks are ready to get to scheduled state
> > and keep some instances just for the queue processing. Would be even
> > great to see where most time is spend. Also thought (if I would have
> > time) adding more metrics to see more details where time is wasted.
> >
> > Anyway getting back to proposal: I am not sure. I think the write-up
> > needs a bit of re-structuring.
> >
> > Reason: Assuming we can not have a "quick fix" with the current logic
> > and before nailing down a new target logic I think we need to re-assess
> > the requirements first. This is not really elaborated but some chapters
> > make implict assumptions. E.g. we drop task level priorities. Can we?
> > Such things are critical entry conditions to plan a target logic and
> > optimize it.
> >
> > Similar with the HITL optimizing individial exceptions. This would be an
> > new, cool additional requirement as game-changer which potentially
> > influences the solution space.
> >
> > When updating the document I'd also propose to change all terms from
> > "scheduling" in the document to "queueing". Took me another time a lot
> > of confusion because in the state model of Airflow the state "scheduled"
> > is used to mark that all predecessors are in the right state that a task
> > can be queued. For me this is "scheduling". This is also a complex
> > calculation but explicit is not scope of the paper. It is only talking
> > about the step where tasks are put to "queued" state for Executor to
> > pick-up. So "making tasks running/executing".
> >
> > Then I also have a bit of a problem with the elaboration about defining
> > priority on Dag level. Whereas I assume that most use cases are OK to
> > prioritize Dags and not individual tasks (but need to check if this is a
> > requirement or not!) I am not sure how this would make the logic of
> > scheduling easier. At the end entities of tasks need to be scheduled.
> > Not Dags. If there is an algorithm that is a candidate then some details
> > are missing to understand if this is an alternative. Otherwise the
> > option does not be really considered.
> >
> > Also as I commented earlier I would like to understand which Dag and
> > task volumes are considered. We know about the todays standard limit of
> > 1024 in Dynamic Mapped tasks. With the change applied can this then be
> > safely increased to 10k? 100k? How many "tasks per hour" do we assume to
> > get scheduled? Will there be a metric of "tasks dropped out of
> > scheduling" because of limits that we can see today and compare to
> > future logic? So which factor is assumed to improve? How much CPU is
> > needed for the N tasks per hour? Will logic scale linear that if I
> > increase to N scheduler instances will it run N-times speed? Or only
> > log(N)?
> >
> > I am not sure about other environments and if the logic helps in all use
> > cases but I assume from our site I think we have 4 major different Dag
> > layout patterns. Will the logic still be OK and improve all of them or
> > do we assume a drawback for the benefit in scaling on the other side?
> >
> >  1. Simple Dag with 1-3 tasks, but thousands of runs scheduled per hour,
> >     ~1000 runs concurrently (no problem today actually, see no
> starvation)
> >  2. Simple Dag but Mapped Task with today 1024 -> Target 100k tasks in a
> >     run. ~10 runs concurrently, maybe more queued (noproblem today up to
> >     <~2000 mapped tasks)
> >  3. Complex Dag with 10-500 tasks in a complex graph of dependency. Not
> >     a large volume of mapped tasks. Many parallel runs (our Dags are not
> >     too complex, other feedback welcome)
> >  4. Medium complex Dag, 10-50 tasks in a Mapped Task Group. Mapping
> >     desired to be 1000+ so in total 50-100k tasks in a single run. Maybe
> >     10 runs (this is a severe problem when mapping is >100 already!)
> >
> > In case (2) and (4) we also saw failures in OOM as the scheduler
> > attempted to load all tasks in memory. Not sure if this was a problem of
> > queuing or maybe even before in getting to scheduled. Would therefore be
> > good to clarify which cases are potential to improve.
> >
> > Some final comment: Have you also considered a simple approach that in
> > the described "starvation" when the dropout rate is high (e.g. 90%) use
> > the dropped Dags and Tasks / Pools that lead to dropout as additional
> > select filter and query another round of tasks via `max_tis_per_query`
> > and repeat until the dropout is below a threshold? That might extend the
> > time for a scheduler loop but due to applied incremental filter improve
> > effectiveness of tasks that are scheduled. As well as maximizes the net
> > time scheduler spends on task queing compared to other things where net
> > scheduler time is spend on if queuing is the limiting factor.
> >
> > While my write-up should not be treated as a rejection I think the
> > AIP-100 needs to be a bit re-sorted to be understandable and convincing.
> > So far I am not clear if the proposed solution is fixing the root.
> >
> > Not sure if there are other experts with other (general) opinions but
> > not having this paper starving in review I could tink of having a 1-2h
> > call to discuss in-person to sort things. I could offer to join to give
> > some guidance if not others are on this already.
> >
> > Jens
> >
> > On 14.05.26 18:28, Jarek Potiuk wrote:
> > > Hi Vikram,
> > >
> > > Natanel has reworked the approach, and it is receiving a much warmer
> > > reception than before. This is Natanel's restart attempt, and there has
> > > been no other discussion about it. While the "why" remains the
> > > same—addressing starvation issues experienced by various users
> (including
> > > Jens' team)—the previous proposals were rejected by Ash, Jens, and me
> due
> > > to excessive complexity, such as building new algorithms via stored
> > > procedures and such.
> > >
> > > The current proposal is simpler and introduces an interesting way to
> > > prioritize SLA callbacks - whichis an interesting feature to have.
> While
> > it
> > > requires further scrutiny, it appears to be a direction we could
> > > potentially accept.
> > >
> > > Best,
> > > Jarek
> > >
> > > On Thu, May 14, 2026 at 4:50 PM Vikram Koka via dev<
> > [email protected]>
> > > wrote:
> > >
> > >> I probably missed the updates here, but when this was brought up at
> the
> > dev
> > >> call a while ago, I thought the response was a firm "No".
> > >> Did I miss the follow-on updates?
> > >>
> > >> The discussion centered on the "why".
> > >> This is such a "power user" feature, which is fine, but it lacks
> > >> articulation of the projected benefit in quantifiable terms.
> > >>
> > >> It has very high technical complexity in the core of Airflow, which
> > could
> > >> cause massive ripple effects and therefore requires a cautious
> approach.
> > >>
> > >> I left a comment in the AIP document as well, about needing to
> > understand
> > >> the "why" better.
> > >>
> > >> Vikram
> > >>
> > >>
> > >>
> > >> On Thu, May 14, 2026 at 1:28 AM Christos Bisias<[email protected]
> >
> > >> wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I've got a question but I can't find a way to add a comment under the
> > >> AIP.
> > >>> The 'Weighted Aging and SLA Urgency' solves the starvation issue but
> do
> > >> we
> > >>> have an idea of what happens with the scheduler's performance?
> > >>>
> > >>> Christos
> > >>>
> > >>> On Thu, May 14, 2026 at 7:47 AM Elad Kalif<[email protected]>
> wrote:
> > >>>
> > >>>> Kind reminder to everyone who still wants to add comments
> > >>>> If no further comments i think we can move this to a vote.
> > >>>>
> > >>>> On Tue, May 5, 2026 at 12:00 AM Jarek Potiuk<[email protected]>
> wrote:
> > >>>>
> > >>>>> Very interesting - also will take a look shortly - and great how it
> > >>> seems
> > >>>>> to tap into SLA urgency indeed..
> > >>>>>
> > >>>>> On Mon, May 4, 2026 at 10:08 PM Przemysław Mirowski <
> > >> [email protected]
> > >>>>> wrote:
> > >>>>>
> > >>>>>> +1 for Weighted Aging and SLA Urgency proposition.
> > >>>>>> ________________________________
> > >>>>>> From: Elad Kalif<[email protected]>
> > >>>>>> Sent: 30 April 2026 12:08
> > >>>>>> To:[email protected] <[email protected]>
> > >>>>>> Subject: Re: [DISCUSS] AIP-100: Eliminate Scheduler Starvation
> > >>>>>>
> > >>>>>> I love the idea of dynamic priorities and I think this is a good
> > >>>>> direction.
> > >>>>>> On Thu, Apr 30, 2026 at 12:58 PM Ash Berlin-Taylor <
> [email protected]
> > >>>>> wrote:
> > >>>>>>> Thanks, this re-drafted version looks interesting. I’m trying to
> > >>>>>>> internalise and understand the new proposal. I’l leave a few
> > >>>>>>> comments/questions on the confluence page as I go.
> > >>>>>>>
> > >>>>>>> (I have to say, the move away from epic stored procedure def is a
> > >>>>> welcome
> > >>>>>>> change!)
> > >>>>>>>
> > >>>>>>> -ash
> > >>>>>>>
> > >>>>>>>> On 22 Apr 2026, at 14:45, Natanel<[email protected]>
> > >>> wrote:
> > >>>>>>>> Hello community, I want to spark again a discussion that was
> > >> held
> > >>>> in
> > >>>>>> the
> > >>>>>>>> past and delayed (due to the focus on releasing airflow 3.2),
> > >> now
> > >>>>> that
> > >>>>>>> 3.2
> > >>>>>>>> is released, I think it might be a good Idea to bring up the
> > >>>>> discussion
> > >>>>>>>> again.
> > >>>>>>>>
> > >>>>>>>> Wiki:
> > >>>>>>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-100+Eliminate+Scheduler+Starvation+On+Concurrency+Limits
> > >>>>>>>> In the current situation, tasks may starve in airflow, and in
> > >>> large
> > >>>>>> scale
> > >>>>>>>> deployments (hundreds of thousands of tasks (or more) per day)
> > >> we
> > >>>>> tend
> > >>>>>> to
> > >>>>>>>> experience severe starvation, where a group of tasks may starve
> > >>>> other
> > >>>>>>>> tasks, not allowing them to run, as described in the wiki.
> > >>>>>>>>
> > >>>>>>>> After the february devcall, where I have proposed the AIP, a
> > >> few
> > >>>>>> comments
> > >>>>>>>> have arised, and so I had begun to research again about
> > >> different
> > >>>>>>>> scheduling algorithms and I have added to the considerations as
> > >>>> part
> > >>>>> of
> > >>>>>>> the
> > >>>>>>>> AIP.
> > >>>>>>>>
> > >>>>>>>> As of now, the state of the AIP is where there are a few ideas
> > >>>>> proposed
> > >>>>>>>> (some of which are pretty similar to each other, while others
> > >> are
> > >>>>> quite
> > >>>>>>>> different), as the main concern from the devcall was that the
> > >>>>>> approaches
> > >>>>>>>> given might not be the best way to solve the issue, as it is a
> > >>> very
> > >>>>>> hard
> > >>>>>>>> problem to solve.
> > >>>>>>>>
> > >>>>>>>> After that, I have made some edits to the AIP and to the
> > >>>>> propositions,
> > >>>>>> in
> > >>>>>>>> order to help decide and clarify the advantages and
> > >> disadvantages
> > >>>> of
> > >>>>>> each
> > >>>>>>>> approach.
> > >>>>>>>> The current "best approach" can be found here here
> > >>>>>>>> <
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618462#AIP100EliminateSchedulerStarvationOnConcurrencyLimits-Currentbestproposition
> > >>>>>>>> ,
> > >>>>>>>> where the new proposed algorithms are defined here
> > >>>>>>>> <
> > >>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618462#AIP100EliminateSchedulerStarvationOnConcurrencyLimits-Othernonagingalgorithmsother_algs
> > >>>>>>>> .
> > >>>>>>>>
> > >>>>>>>> In order to continue with the effort, a community consensus
> > >> needs
> > >>>> to
> > >>>>> be
> > >>>>>>>> reached about the preferred solution/solutions, once this is
> > >>> done,
> > >>>> it
> > >>>>>> is
> > >>>>>>>> possible to go on and implement + stress test the proposed
> > >>>>> solution/s.
> > >>>>>>>> I would appreciate a review from community members, moreover, I
> > >>>> would
> > >>>>>>> also
> > >>>>>>>> appreciate any new propositions or improvements which can be
> > >>> done.
> > >>>>>>>> Best regards
> > >>>>>>>> Natanel.
> > >>>>>>>
> > >>>>>>>
> > >>> ---------------------------------------------------------------------
> > >>>>>>> To unsubscribe, e-mail:[email protected]
> > >>>>>>> For additional commands, e-mail:[email protected]
> > >>>>>>>
> > >>>>>>>
>

Re: [DISCUSS] AIP-100: Eliminate Scheduler Starvation

Reply via email to