Thanks Jens and Jarek for the detailed analysis. The thread has helped surface the right questions.
Before we discuss solutions, whether incremental SQL pre-filtering, scoring formulas, or HITL priority overrides, I think we need to go back to basics: *what specific starvation scenarios are we solving, and where is the evidence that they exist on current main?* The AIP asserts that scheduler starvation is a problem, but I haven't seen: - A concrete workload definition where starvation occurs (which Dag patterns, what concurrency settings, what pool configuration) - A reproduction on current main that demonstrates tasks being starved, with metrics showing how long and how severely - An analysis of whether the existing mitigation mechanisms (the itertools.count re-scan loop, the "starved" NOT IN filters, the row_number(), windowing) are actually failing, and if so, why Without this foundation, we're designing solutions to a problem that may not exist in practice, or that may already be adequately handled by the mechanisms Jarek described. Additionally, AIP-98 (async support for PythonOperator, shipped in 3.2.0) already (at least partially) addresses two of the four workload patterns Jens identified: - Patterns (2) and (4), both based on Dynamic Task Mapping. Scenarios that previously generated thousands of individually scheduled mapped task instances - These can now run as a single async task on a worker, removing that load from the scheduler entirely. - Any starvation analysis needs to account for mitigations that have already shipped before concluding that new scheduler logic is needed. My recommendation: - Start with benchmarks. - Use Jens's four Dag layout patterns as a framework, run them against current main with realistic concurrency and pool settings, and show us where starvation actually occurs. Let's use the revealed data to have a much more focused discussion about where and how to address it. Best regards, Vikram On Fri, May 15, 2026 at 2:16 AM Jarek Potiuk <[email protected]> wrote: > Hi Natanel, > > I had a long conversation with my agent, who looked at the proposal, > Current code and earlier discussions.... I did review it and corrected it. > A few things might not be entirely accurate, of course. However > The final review report seems plausible. Proposal to take > baby steps and actually **measure** where we are today and whether > those baby steps will be enough to improve your experience seems > to be very good. So ... let me hand it off to what my agent drafted: > > > Thanks for picking this up again -- the redrafted version is much more > focused than the previous attempt and I want to engage with it > constructively. A few code-grounded points below, several of which > reinforce what Jens raised in his reply yesterday. > > > ** On Jens's "simple approach" -- it already exists ** > Jens described what is essentially the outer loop currently in > `_executable_task_instances_to_queued` on `main`. I think it's worth > surfacing this in the AIP itself, because two of the building blocks > the proposal recommends are already partially in production -- which > changes the baseline the AIP has to beat. > > In the code today: > > - `for loop_count in itertools.count(start=1)` at > airflow-core/src/airflow/jobs/scheduler_job_runner.py:542 > - four "starved" sets accumulating constraints discovered in the > Python pass (:526-538, populated in the per-TI loop at :692-858) > - fed back as NOT IN filters on the next iteration (:581-593) > - exit at :870-880 when either a TI was queued, the SQL returned > fewer rows than asked for (no more candidates), or no new filter > was learned. > > Alongside that, a window function caps `max_active_tasks_per_dag` at > SQL level (:598-612) -- `row_number() OVER (PARTITION BY dag_id, run_id > ORDER BY -priority_weight, ...)` filtered by `row_num <= > dr_max_active_tasks`. A 100k-mapped-task run with `max_active_tasks=8` > contributes at most 8 rows to the batch, not 100k. So the "scheduler > fixates on one constrained group" pathology the AIP attributes to the > current design is already mitigated for that dimension. > > What is *not* yet pushed into SQL: pool slots, executor slots, > `max_active_tis_per_dag`, `max_active_tis_per_dagrun`. The natural > incremental next step is to extend the same windowing / pre-filter > pattern to each of those -- which is exactly what PR #54103 does for > the pool dimension. Each limit moved from the Python loop (:689-858) > into SQL shrinks the batch the scheduler has to walk. > > > ** Which means: benchmarks need to be against latest `main` ** > Vikram and Jens both flagged the missing "hard facts." I would frame it > more specifically: the document is comparing against an old > characterization of the scheduler. A workload like Jens's pattern (4) > (medium-complex DAG with a 50-100k-task mapped task group) needs to be > benchmarked against current `main` -- with `max_active_tasks` capping > and the feedback loop active -- and ideally with the further > incremental SQL-side filters from #54103-style follow-ups, before the > case for scoring-formula redesign is convincing. The starvation gap is > shrinking iteration by iteration; the AIP needs to show what's left. > > > ** Starvation is a query-shape problem, not a ranking problem ** > If the loaded batch is 1024 TIs from a DAG already at some Python-side > limit (executor slots, `max_active_tis_per_dagrun`, etc.), no scoring > formula recovers throughput -- only SQL-level filtering does. Aging / > HRRN / urgency only changes which TI wins a contested slot *once* > candidates are loaded. So the recommended direction does not obviously > dominate the dismissed SQL-shape alternatives -- PR #54103 (single- > limit pre-filter, the natural next windowing step) and PR #55537 (1.7- > 2x via SQL procedures) deserve to be the baseline the new design has to > beat, not a footnote. > > > ** The scoring formula's inputs need granularity clarification ** > `deadline` is actually first-class as of Airflow 3.0, via > `DAG(deadline=DeadlineAlert(...))` > (airflow-core/src/airflow/models/deadline.py, models/deadline_alert.py) > with references `DAGRUN_LOGICAL_DATE`, `DAGRUN_QUEUED_AT`, > `AVERAGE_RUNTIME(...)`, `FIXED_DATETIME(...)`. But the existing > mechanism is *DagRun-level* and fires a *callback on miss* via the > scheduler's per-loop check at scheduler_job_runner.py:1674-1693. > TaskInstance has no deadline column. The AIP's per-task > `Wu/(deadline-now-service)` term therefore needs either (a) a way to > derive a per-TI deadline from a DagRun-level `DeadlineAlert` -- non- > trivial for interval-based alerts, ambiguous when a DAG carries > multiple alerts -- or (b) a new per-TI deadline column with its own > user-facing API. `service_time` similarly overlaps with the existing > `AVERAGE_RUNTIME` reference's averaging machinery, but again at DagRun > granularity. Worth spelling out in the AIP which path is intended; > conflating "deadline as miss-callback trigger" with "deadline as > scheduling-urgency signal" is a substantive semantic shift. > > Separately: HRRN is a preemptive-scheduler heuristic. Airflow's > scheduler is non-preemptive -- once a TI is queued, it runs. The queue- > selection window is short relative to task runtimes, so the value of > aging is mostly cosmetic. > > ** HA / critical-section concern ** > > The critical section runs under `pg_try_advisory_xact_lock` (:499) and > `with_row_locks(..., skip_locked=True)` (:642). Replacing > `order_by(-TI.priority_weight, ...)` with a computed expression over > multiple columns risks lengthening the held critical section and > hurting HA throughput. Worth measuring before committing to in-SQL > scoring. > > > ** On terminology -- +1 to Jens ** > In Airflow's state model "scheduled" already means "predecessors done, > eligible to be queued", and the function the AIP is rewriting is > `_executable_task_instances_to_queued`. Renaming "scheduling" to > "queueing" throughout the document would remove a lot of avoidable > confusion -- I had to re-read several sections to keep the terms > straight. > > > ** HITL is separable ** > This is the load-bearing observation behind Jens's `UPDATE > task_instance SET priority_weight = priority_weight + 100 WHERE ...` > question on the wiki: the mechanism for HITL priority override > basically exists today. > > - `priority_weight` is a real column (taskinstance.py:544). > - The scheduler reads it directly when ranking > (scheduler_job_runner.py:575 / :603). > - So the SQL UPDATE on a scheduled TI does change which TI wins the > next pool slot. > - Caveat: `refresh_from_task()` (taskinstance.py:892) can recompute > it on DAG re-parse / map expansion, so it's racy. > - No REST/UI surface (`PatchTaskInstanceBody` only accepts > `new_state` and `note`). > > So a supported HITL override is a two-file change -- a non-clobbered > `priority_override` column + a REST endpoint + > `order_by(-coalesce(TI.priority_override, TI.priority_weight), ...)` > -- that does not need any of the multi-factor scoring machinery to > exist. I would split it out of AIP-100 into its own small AIP, since > it can ship today and the scoring redesign is contingent on the > benchmark question above. > > > ** Suggested split ** > In the spirit of Jens's "requirements are entry conditions, not > results" framing -- the AIP currently bundles three different products > with different blast radii: > > (a) Extending the existing windowing / feedback-loop pattern to the > remaining limits (pool slots, executor slots, > `max_active_tis_per_dag`, `max_active_tis_per_dagrun`). > Incremental, low risk, builds on what works, addresses the > actual starvation cases. PR #54103 is the template. > > (b) HITL priority override -- separable as above. > > (c) Multi-factor scoring (HRRN + deadline + service_time) -- > speculative until (a) and (b) demonstrate it's still necessary, > and contingent on resolving the granularity question for > `deadline` and `service_time`. > > Approving the bundle smuggles a verdict on each. Splitting lets (a) > and (b) ship while (c) gets the benchmark and granularity discussion > it needs. > > * On the 1-2h call Jens proposed * > > I think this would be productive -- if you and Jens want to organise > it, I would try to make it. Anchoring the agenda on "what's the gap > between current `main` (with windowing + feedback loop + PR > #54103-style incremental extensions) and pattern (4) workloads?" > would resolve a lot of the open questions. > > Best, > Jarek > > --- > Drafted-by: Claude Code (Opus 4.7); reviewed by Jarek before sending > > On Thu, May 14, 2026 at 11:34 PM Jens Scheffler <[email protected]> > wrote: > > > Hi, > > > > sorry after some time re-visited the document and must say... is a long > > read in a large complexity. Needed to read parts multiple times. Maybe > > it is excessively hard because over time it fragmented the chapters and > > approaches a bit. Maybe also due to this I am not really sold and > > similar like Vikram missing some "hard facts" and requirements. Yes I > > know we have limits and also in our environments we see sometimes > > starvations... but I am not sure if the proposed solution will fix it. > > Sometimes I feel like starvation is also on DB level or other limits? I > > am not sure. > > > > When I take a look to our production then where scheduler today also > > "starves" and is locked for a long time are a lot of other tasks which > > are unrelated to queuing, like finding orphaned tasks, resetting them > > because of missing heartbeats or so. Applying Dag changes becuase of Dag > > version changes. In our production I mostly feel like if the scheduler > > would focus on scheduling tasks then it would get things done. I > > somethings thought making two instances "just" for scheduling Dag runs, > > two only for calculating if tasks are ready to get to scheduled state > > and keep some instances just for the queue processing. Would be even > > great to see where most time is spend. Also thought (if I would have > > time) adding more metrics to see more details where time is wasted. > > > > Anyway getting back to proposal: I am not sure. I think the write-up > > needs a bit of re-structuring. > > > > Reason: Assuming we can not have a "quick fix" with the current logic > > and before nailing down a new target logic I think we need to re-assess > > the requirements first. This is not really elaborated but some chapters > > make implict assumptions. E.g. we drop task level priorities. Can we? > > Such things are critical entry conditions to plan a target logic and > > optimize it. > > > > Similar with the HITL optimizing individial exceptions. This would be an > > new, cool additional requirement as game-changer which potentially > > influences the solution space. > > > > When updating the document I'd also propose to change all terms from > > "scheduling" in the document to "queueing". Took me another time a lot > > of confusion because in the state model of Airflow the state "scheduled" > > is used to mark that all predecessors are in the right state that a task > > can be queued. For me this is "scheduling". This is also a complex > > calculation but explicit is not scope of the paper. It is only talking > > about the step where tasks are put to "queued" state for Executor to > > pick-up. So "making tasks running/executing". > > > > Then I also have a bit of a problem with the elaboration about defining > > priority on Dag level. Whereas I assume that most use cases are OK to > > prioritize Dags and not individual tasks (but need to check if this is a > > requirement or not!) I am not sure how this would make the logic of > > scheduling easier. At the end entities of tasks need to be scheduled. > > Not Dags. If there is an algorithm that is a candidate then some details > > are missing to understand if this is an alternative. Otherwise the > > option does not be really considered. > > > > Also as I commented earlier I would like to understand which Dag and > > task volumes are considered. We know about the todays standard limit of > > 1024 in Dynamic Mapped tasks. With the change applied can this then be > > safely increased to 10k? 100k? How many "tasks per hour" do we assume to > > get scheduled? Will there be a metric of "tasks dropped out of > > scheduling" because of limits that we can see today and compare to > > future logic? So which factor is assumed to improve? How much CPU is > > needed for the N tasks per hour? Will logic scale linear that if I > > increase to N scheduler instances will it run N-times speed? Or only > > log(N)? > > > > I am not sure about other environments and if the logic helps in all use > > cases but I assume from our site I think we have 4 major different Dag > > layout patterns. Will the logic still be OK and improve all of them or > > do we assume a drawback for the benefit in scaling on the other side? > > > > 1. Simple Dag with 1-3 tasks, but thousands of runs scheduled per hour, > > ~1000 runs concurrently (no problem today actually, see no > starvation) > > 2. Simple Dag but Mapped Task with today 1024 -> Target 100k tasks in a > > run. ~10 runs concurrently, maybe more queued (noproblem today up to > > <~2000 mapped tasks) > > 3. Complex Dag with 10-500 tasks in a complex graph of dependency. Not > > a large volume of mapped tasks. Many parallel runs (our Dags are not > > too complex, other feedback welcome) > > 4. Medium complex Dag, 10-50 tasks in a Mapped Task Group. Mapping > > desired to be 1000+ so in total 50-100k tasks in a single run. Maybe > > 10 runs (this is a severe problem when mapping is >100 already!) > > > > In case (2) and (4) we also saw failures in OOM as the scheduler > > attempted to load all tasks in memory. Not sure if this was a problem of > > queuing or maybe even before in getting to scheduled. Would therefore be > > good to clarify which cases are potential to improve. > > > > Some final comment: Have you also considered a simple approach that in > > the described "starvation" when the dropout rate is high (e.g. 90%) use > > the dropped Dags and Tasks / Pools that lead to dropout as additional > > select filter and query another round of tasks via `max_tis_per_query` > > and repeat until the dropout is below a threshold? That might extend the > > time for a scheduler loop but due to applied incremental filter improve > > effectiveness of tasks that are scheduled. As well as maximizes the net > > time scheduler spends on task queing compared to other things where net > > scheduler time is spend on if queuing is the limiting factor. > > > > While my write-up should not be treated as a rejection I think the > > AIP-100 needs to be a bit re-sorted to be understandable and convincing. > > So far I am not clear if the proposed solution is fixing the root. > > > > Not sure if there are other experts with other (general) opinions but > > not having this paper starving in review I could tink of having a 1-2h > > call to discuss in-person to sort things. I could offer to join to give > > some guidance if not others are on this already. > > > > Jens > > > > On 14.05.26 18:28, Jarek Potiuk wrote: > > > Hi Vikram, > > > > > > Natanel has reworked the approach, and it is receiving a much warmer > > > reception than before. This is Natanel's restart attempt, and there has > > > been no other discussion about it. While the "why" remains the > > > same—addressing starvation issues experienced by various users > (including > > > Jens' team)—the previous proposals were rejected by Ash, Jens, and me > due > > > to excessive complexity, such as building new algorithms via stored > > > procedures and such. > > > > > > The current proposal is simpler and introduces an interesting way to > > > prioritize SLA callbacks - whichis an interesting feature to have. > While > > it > > > requires further scrutiny, it appears to be a direction we could > > > potentially accept. > > > > > > Best, > > > Jarek > > > > > > On Thu, May 14, 2026 at 4:50 PM Vikram Koka via dev< > > [email protected]> > > > wrote: > > > > > >> I probably missed the updates here, but when this was brought up at > the > > dev > > >> call a while ago, I thought the response was a firm "No". > > >> Did I miss the follow-on updates? > > >> > > >> The discussion centered on the "why". > > >> This is such a "power user" feature, which is fine, but it lacks > > >> articulation of the projected benefit in quantifiable terms. > > >> > > >> It has very high technical complexity in the core of Airflow, which > > could > > >> cause massive ripple effects and therefore requires a cautious > approach. > > >> > > >> I left a comment in the AIP document as well, about needing to > > understand > > >> the "why" better. > > >> > > >> Vikram > > >> > > >> > > >> > > >> On Thu, May 14, 2026 at 1:28 AM Christos Bisias<[email protected] > > > > >> wrote: > > >> > > >>> Hello, > > >>> > > >>> I've got a question but I can't find a way to add a comment under the > > >> AIP. > > >>> The 'Weighted Aging and SLA Urgency' solves the starvation issue but > do > > >> we > > >>> have an idea of what happens with the scheduler's performance? > > >>> > > >>> Christos > > >>> > > >>> On Thu, May 14, 2026 at 7:47 AM Elad Kalif<[email protected]> > wrote: > > >>> > > >>>> Kind reminder to everyone who still wants to add comments > > >>>> If no further comments i think we can move this to a vote. > > >>>> > > >>>> On Tue, May 5, 2026 at 12:00 AM Jarek Potiuk<[email protected]> > wrote: > > >>>> > > >>>>> Very interesting - also will take a look shortly - and great how it > > >>> seems > > >>>>> to tap into SLA urgency indeed.. > > >>>>> > > >>>>> On Mon, May 4, 2026 at 10:08 PM Przemysław Mirowski < > > >> [email protected] > > >>>>> wrote: > > >>>>> > > >>>>>> +1 for Weighted Aging and SLA Urgency proposition. > > >>>>>> ________________________________ > > >>>>>> From: Elad Kalif<[email protected]> > > >>>>>> Sent: 30 April 2026 12:08 > > >>>>>> To:[email protected] <[email protected]> > > >>>>>> Subject: Re: [DISCUSS] AIP-100: Eliminate Scheduler Starvation > > >>>>>> > > >>>>>> I love the idea of dynamic priorities and I think this is a good > > >>>>> direction. > > >>>>>> On Thu, Apr 30, 2026 at 12:58 PM Ash Berlin-Taylor < > [email protected] > > >>>>> wrote: > > >>>>>>> Thanks, this re-drafted version looks interesting. I’m trying to > > >>>>>>> internalise and understand the new proposal. I’l leave a few > > >>>>>>> comments/questions on the confluence page as I go. > > >>>>>>> > > >>>>>>> (I have to say, the move away from epic stored procedure def is a > > >>>>> welcome > > >>>>>>> change!) > > >>>>>>> > > >>>>>>> -ash > > >>>>>>> > > >>>>>>>> On 22 Apr 2026, at 14:45, Natanel<[email protected]> > > >>> wrote: > > >>>>>>>> Hello community, I want to spark again a discussion that was > > >> held > > >>>> in > > >>>>>> the > > >>>>>>>> past and delayed (due to the focus on releasing airflow 3.2), > > >> now > > >>>>> that > > >>>>>>> 3.2 > > >>>>>>>> is released, I think it might be a good Idea to bring up the > > >>>>> discussion > > >>>>>>>> again. > > >>>>>>>> > > >>>>>>>> Wiki: > > >>>>>>>> > > >> > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-100+Eliminate+Scheduler+Starvation+On+Concurrency+Limits > > >>>>>>>> In the current situation, tasks may starve in airflow, and in > > >>> large > > >>>>>> scale > > >>>>>>>> deployments (hundreds of thousands of tasks (or more) per day) > > >> we > > >>>>> tend > > >>>>>> to > > >>>>>>>> experience severe starvation, where a group of tasks may starve > > >>>> other > > >>>>>>>> tasks, not allowing them to run, as described in the wiki. > > >>>>>>>> > > >>>>>>>> After the february devcall, where I have proposed the AIP, a > > >> few > > >>>>>> comments > > >>>>>>>> have arised, and so I had begun to research again about > > >> different > > >>>>>>>> scheduling algorithms and I have added to the considerations as > > >>>> part > > >>>>> of > > >>>>>>> the > > >>>>>>>> AIP. > > >>>>>>>> > > >>>>>>>> As of now, the state of the AIP is where there are a few ideas > > >>>>> proposed > > >>>>>>>> (some of which are pretty similar to each other, while others > > >> are > > >>>>> quite > > >>>>>>>> different), as the main concern from the devcall was that the > > >>>>>> approaches > > >>>>>>>> given might not be the best way to solve the issue, as it is a > > >>> very > > >>>>>> hard > > >>>>>>>> problem to solve. > > >>>>>>>> > > >>>>>>>> After that, I have made some edits to the AIP and to the > > >>>>> propositions, > > >>>>>> in > > >>>>>>>> order to help decide and clarify the advantages and > > >> disadvantages > > >>>> of > > >>>>>> each > > >>>>>>>> approach. > > >>>>>>>> The current "best approach" can be found here here > > >>>>>>>> < > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618462#AIP100EliminateSchedulerStarvationOnConcurrencyLimits-Currentbestproposition > > >>>>>>>> , > > >>>>>>>> where the new proposed algorithms are defined here > > >>>>>>>> < > > >> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618462#AIP100EliminateSchedulerStarvationOnConcurrencyLimits-Othernonagingalgorithmsother_algs > > >>>>>>>> . > > >>>>>>>> > > >>>>>>>> In order to continue with the effort, a community consensus > > >> needs > > >>>> to > > >>>>> be > > >>>>>>>> reached about the preferred solution/solutions, once this is > > >>> done, > > >>>> it > > >>>>>> is > > >>>>>>>> possible to go on and implement + stress test the proposed > > >>>>> solution/s. > > >>>>>>>> I would appreciate a review from community members, moreover, I > > >>>> would > > >>>>>>> also > > >>>>>>>> appreciate any new propositions or improvements which can be > > >>> done. > > >>>>>>>> Best regards > > >>>>>>>> Natanel. > > >>>>>>> > > >>>>>>> > > >>> --------------------------------------------------------------------- > > >>>>>>> To unsubscribe, e-mail:[email protected] > > >>>>>>> For additional commands, e-mail:[email protected] > > >>>>>>> > > >>>>>>> >
