Hi,

sorry after some time re-visited the document and must say... is a long read in a large complexity. Needed to read parts multiple times. Maybe it is excessively hard because over time it fragmented the chapters and approaches a bit. Maybe also due to this I am not really sold and similar like Vikram missing some "hard facts" and requirements. Yes I know we have limits and also in our environments we see sometimes starvations... but I am not sure if the proposed solution will fix it. Sometimes I feel like starvation is also on DB level or other limits? I am not sure.

When I take a look to our production then where scheduler today also "starves" and is locked for a long time are a lot of other tasks  which are unrelated to queuing, like finding orphaned tasks, resetting them because of missing heartbeats or so. Applying Dag changes becuase of Dag version changes. In our production I mostly feel like if the scheduler would focus on scheduling tasks then it would get things done. I somethings thought making two instances "just" for scheduling Dag runs, two only for calculating if tasks are ready to get to scheduled state and keep some instances just for the queue processing. Would be even great to see where most time is spend. Also thought (if I would have time) adding more metrics to see more details where time is wasted.

Anyway getting back to proposal: I am not sure. I think the write-up needs a bit of re-structuring.

Reason: Assuming we can not have a "quick fix" with the current logic and before nailing down a new target logic I think we need to re-assess the requirements first. This is not really elaborated but some chapters make implict assumptions. E.g. we drop task level priorities. Can we? Such things are critical entry conditions to plan a target logic and optimize it.

Similar with the HITL optimizing individial exceptions. This would be an new, cool additional requirement as game-changer which potentially influences the solution space.

When updating the document I'd also propose to change all terms from "scheduling" in the document to "queueing". Took me another time a lot of confusion because in the state model of Airflow the state "scheduled" is used to mark that all predecessors are in the right state that a task can be queued. For me this is "scheduling". This is also a complex calculation but explicit is not scope of the paper. It is only talking about the step where tasks are put to "queued" state for Executor to pick-up. So "making tasks running/executing".

Then I also have a bit of a problem with the elaboration about defining priority on Dag level. Whereas I assume that most use cases are OK to prioritize Dags and not individual tasks (but need to check if this is a requirement or not!) I am not sure how this would make the logic of scheduling easier. At the end entities of tasks need to be scheduled. Not Dags. If there is an algorithm that is a candidate then some details are missing to understand if this is an alternative. Otherwise the option does not be really considered.

Also as I commented earlier I would like to understand which Dag and task volumes are considered. We know about the todays standard limit of 1024 in Dynamic Mapped tasks. With the change applied can this then be safely increased to 10k? 100k? How many "tasks per hour" do we assume to get scheduled? Will there be a metric of "tasks dropped out of scheduling" because of limits that we can see today and compare to future logic? So which factor is assumed to improve? How much CPU is needed for the N tasks per hour? Will logic scale linear that if I increase to N scheduler instances will it run N-times speed? Or only log(N)?

I am not sure about other environments and if the logic helps in all use cases but I assume from our site I think we have 4 major different Dag layout patterns. Will the logic still be OK and improve all of them or do we assume a drawback for the benefit in scaling on the other side?

1. Simple Dag with 1-3 tasks, but thousands of runs scheduled per hour,
   ~1000 runs concurrently (no problem today actually, see no starvation)
2. Simple Dag but Mapped Task with today 1024 -> Target 100k tasks in a
   run. ~10 runs concurrently, maybe more queued (noproblem today up to
   <~2000 mapped tasks)
3. Complex Dag with 10-500 tasks in a complex graph of dependency. Not
   a large volume of mapped tasks. Many parallel runs (our Dags are not
   too complex, other feedback welcome)
4. Medium complex Dag, 10-50 tasks in a Mapped Task Group. Mapping
   desired to be 1000+ so in total 50-100k tasks in a single run. Maybe
   10 runs (this is a severe problem when mapping is >100 already!)

In case (2) and (4) we also saw failures in OOM as the scheduler attempted to load all tasks in memory. Not sure if this was a problem of queuing or maybe even before in getting to scheduled. Would therefore be good to clarify which cases are potential to improve.

Some final comment: Have you also considered a simple approach that in the described "starvation" when the dropout rate is high (e.g. 90%) use the dropped Dags and Tasks / Pools that lead to dropout as additional select filter and query another round of tasks via `max_tis_per_query` and repeat until the dropout is below a threshold? That might extend the time for a scheduler loop but due to applied incremental filter improve effectiveness of tasks that are scheduled. As well as maximizes the net time scheduler spends on task queing compared to other things where net scheduler time is spend on if queuing is the limiting factor.

While my write-up should not be treated as a rejection I think the AIP-100 needs to be a bit re-sorted to be understandable and convincing. So far I am not clear if the proposed solution is fixing the root.

Not sure if there are other experts with other (general) opinions but not having this paper starving in review I could tink of having a 1-2h call to discuss in-person to sort things. I could offer to join to give some guidance if not others are on this already.

Jens

On 14.05.26 18:28, Jarek Potiuk wrote:
Hi Vikram,

Natanel has reworked the approach, and it is receiving a much warmer
reception than before. This is Natanel's restart attempt, and there has
been no other discussion about it. While the "why" remains the
same—addressing starvation issues experienced by various users (including
Jens' team)—the previous proposals were rejected by Ash, Jens, and me due
to excessive complexity, such as building new algorithms via stored
procedures and such.

The current proposal is simpler and introduces an interesting way to
prioritize SLA callbacks - whichis an interesting feature to have. While it
requires further scrutiny, it appears to be a direction we could
potentially accept.

Best,
Jarek

On Thu, May 14, 2026 at 4:50 PM Vikram Koka via dev<[email protected]>
wrote:

I probably missed the updates here, but when this was brought up at the dev
call a while ago, I thought the response was a firm "No".
Did I miss the follow-on updates?

The discussion centered on the "why".
This is such a "power user" feature, which is fine, but it lacks
articulation of the projected benefit in quantifiable terms.

It has very high technical complexity in the core of Airflow, which could
cause massive ripple effects and therefore requires a cautious approach.

I left a comment in the AIP document as well, about needing to understand
the "why" better.

Vikram



On Thu, May 14, 2026 at 1:28 AM Christos Bisias<[email protected]>
wrote:

Hello,

I've got a question but I can't find a way to add a comment under the
AIP.
The 'Weighted Aging and SLA Urgency' solves the starvation issue but do
we
have an idea of what happens with the scheduler's performance?

Christos

On Thu, May 14, 2026 at 7:47 AM Elad Kalif<[email protected]> wrote:

Kind reminder to everyone who still wants to add comments
If no further comments i think we can move this to a vote.

On Tue, May 5, 2026 at 12:00 AM Jarek Potiuk<[email protected]> wrote:

Very interesting - also will take a look shortly - and great how it
seems
to tap into SLA urgency indeed..

On Mon, May 4, 2026 at 10:08 PM Przemysław Mirowski <
[email protected]
wrote:

+1 for Weighted Aging and SLA Urgency proposition.
________________________________
From: Elad Kalif<[email protected]>
Sent: 30 April 2026 12:08
To:[email protected] <[email protected]>
Subject: Re: [DISCUSS] AIP-100: Eliminate Scheduler Starvation

I love the idea of dynamic priorities and I think this is a good
direction.
On Thu, Apr 30, 2026 at 12:58 PM Ash Berlin-Taylor <[email protected]
wrote:
Thanks, this re-drafted version looks interesting. I’m trying to
internalise and understand the new proposal. I’l leave a few
comments/questions on the confluence page as I go.

(I have to say, the move away from epic stored procedure def is a
welcome
change!)

-ash

On 22 Apr 2026, at 14:45, Natanel<[email protected]>
wrote:
Hello community, I want to spark again a discussion that was
held
in
the
past and delayed (due to the focus on releasing airflow 3.2),
now
that
3.2
is released, I think it might be a good Idea to bring up the
discussion
again.

Wiki:

https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-100+Eliminate+Scheduler+Starvation+On+Concurrency+Limits
In the current situation, tasks may starve in airflow, and in
large
scale
deployments (hundreds of thousands of tasks (or more) per day)
we
tend
to
experience severe starvation, where a group of tasks may starve
other
tasks, not allowing them to run, as described in the wiki.

After the february devcall, where I have proposed the AIP, a
few
comments
have arised, and so I had begun to research again about
different
scheduling algorithms and I have added to the considerations as
part
of
the
AIP.

As of now, the state of the AIP is where there are a few ideas
proposed
(some of which are pretty similar to each other, while others
are
quite
different), as the main concern from the devcall was that the
approaches
given might not be the best way to solve the issue, as it is a
very
hard
problem to solve.

After that, I have made some edits to the AIP and to the
propositions,
in
order to help decide and clarify the advantages and
disadvantages
of
each
approach.
The current "best approach" can be found here here
<
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618462#AIP100EliminateSchedulerStarvationOnConcurrencyLimits-Currentbestproposition
,
where the new proposed algorithms are defined here
<
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618462#AIP100EliminateSchedulerStarvationOnConcurrencyLimits-Othernonagingalgorithmsother_algs
.

In order to continue with the effort, a community consensus
needs
to
be
reached about the preferred solution/solutions, once this is
done,
it
is
possible to go on and implement + stress test the proposed
solution/s.
I would appreciate a review from community members, moreover, I
would
also
appreciate any new propositions or improvements which can be
done.
Best regards
Natanel.


---------------------------------------------------------------------
To unsubscribe, e-mail:[email protected]
For additional commands, e-mail:[email protected]

Reply via email to