Good point that Stefan made - I also had commented on the relation to AIP-97 which I would love to have or converge AIP-105 with.

In this light, actually what would be the intend of the Retry policy if the worker "dies" in a segfault or loses heartbeat? Then the standard / existing scheduler based retry is kicking-in?

Jens

On 21.04.26 02:19, Stefan Wang wrote:
Thanks Kaxil,
huge +1

This feels like a meaningful step forward.

Giving users a way to express retry intent and putting the policy on the
operator is something we've needed for a while. The current options
aren't great: wrap everything in try/except and raise
AirflowFailException, or live with retries=3 as a blunt instrument.
Both are compromises.

A few things that stand out in the design:

1. I think Evaluating on the worker is the right call. Exceptions don't 
serialize
cleanly across process boundaries, and keeping the decision close to
where the exception actually happens avoids a lot of indirection. The
scheduler-side version would be simpler to ship but harder to use.

2. The flat rule list is easier to reason about and validate at parse time
than a nested structure would be. Elad's suggestion to let one rule
match multiple exception types would tighten the common case without
losing that.

A couple of thoughts that came up while reading:

1. On Paweł's testing point: if policy.evaluate() is just a method you can
call with a synthetic exception, DAG authors can cover a lot of ground
in unit tests. Not the same as validating in production, but catches a
decent amount before deploy.

2. On retry budgets (separate infra retry budget) more broadly:
retries=N today can get consumed by
worker evictions or heartbeat losses before any retry policy ever runs.
Pluggable policies will feel sharper once the user-visible budget
actually reflects user-domain failures. I also have two drafts touching this
area, AIP-96 (Resumable Operators) and AIP-97 (Execution Context + separate 
infra
retry budget), and will post updates on both soon. Open to converging where it 
makes sense.
For what it's worth, we've been running two related pieces in production
at LinkedIn. One is a mixin that preserves external jobs (Spark, Flink,
and similar) when the worker gets disrupted instead of cancelling them.
The other is a separate infrastructure retry budget set generously
enough that infrastructure events don't eat into user-visible retries. I
can share anonymized failure-category data from both if it would help
ground the default rule library.

Looking forward to v2.

— Stefan

On Apr 20, 2026, at 1:50 PM, Przemysław Mirowski <[email protected]> wrote:

Great idea! Thanks for proposing it. It will make proper exception-retry 
handling much easier than it was before and will open a new door for more 
extensibility too.

+1 also to the questions/concerts which Elad mentioned. Not sure though 
regarding the changes to Priority Weight (maybe part of AIP-100) and point 2 
connected to not having full control over exception raised, looking at the 
Airflow ecosystem, all of the providers with different libraries, I think it is 
something which we should consider.

One additional comment - as the Retry Policies will only run on workers (which is pretty 
nice from e.g. security point of view), I didn't see in AIP and PR a way to validate if 
configured Retry Policy will work before actually the time when it will be needed. That 
can make setting the Retry Policies harder and the testing them will be cumbersome. I 
think that having a nice way (from Dag Authors perspective) of testing the defined Retry 
Policy if it will actually work when it really be needed, would make Dag Authors lifes 
much easier and defining these rules much easier (something in some way connected to that 
could be testing the Airflow Connections and work for moving the "Test 
Connection" to workers). Of course, Retry Policies like LLM-related are rather 
out-of-scope, but testing more deterministic behaviours should be much easier to do.

________________________________
From: Vincent Beck <[email protected]>
Sent: 20 April 2026 15:17
To: [email protected] <[email protected]>
Subject: Re: [DISCUSS] AIP-105: Pluggable Retry Policies

Makes a lot of sense to me!

On 2026/04/19 13:56:56 Elad Kalif wrote:
Great idea!
Love it!

I have some questions / comments:
1. The current interface suggests rules that contain a RetryRule object.
but I wonder if we should change exception to exceptions and accepting a
list.

        rules=[
            RetryRule(
            exceptions=["requests.exceptions.HTTPError",
"google.auth.exceptions.RefreshError"]
                    ...,
)]

I'm thinking about a case where several exceptions need the same behaviour
and user may not wish to offer different reasoning for each.

2. Does it make sense to extend the interface for xcom values? I'm thinking
about a case where dag authors don't have full control over the exception
raised or even some upstream library changing the exception which results
in retry logic to be broken. Maybe we should offer also the option to set
retry based on previous attempt xcom value?

3. Maybe something for the longer run but still worth discussing - one of
the main motivations for custom weight rules
https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html#custom-weight-rule
was to set priority based on try number. I wonder if we may want to somehow
combine it with the Retry rule. For retries, I can argue that the weight of
the task is a property of retry instructions and it can very be that the
weight will change depending on the exception.

On Sun, Apr 19, 2026 at 6:30 AM Shahar Epstein <[email protected]> wrote:

Great idea! I liked both the deterministic approach as well as the AI
integrated.


Shahar

On Sat, Apr 18, 2026 at 3:02 AM Kaxil Naik <[email protected]> wrote:

Hi all,

Continuing the push to make Airflow AI-native, I have put together
AIP-105:
Pluggable Retry Policies.

Wiki:


https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies
PR (core): https://github.com/apache/airflow/pull/65450
PR (LLM-powered, common-ai provider):
https://github.com/apache/airflow/pull/65451

The problem is straightforward: Airflow retries every failure the same
way.
An expired API key gets retried 3 times over 15 minutes. A rate-limited
API
gets retried immediately, hitting the same 429. Users who want smarter
retries today have to wrap every task in try/except and raise
AirflowFailException manually, mixing retry logic into business logic.

This AIP adds a retry_policy parameter to BaseOperator. The policy
evaluates the actual exception at failure time and returns RETRY (with a
custom delay), FAIL (skip remaining retries), or DEFAULT (standard
behaviour). It runs in the worker process, not the scheduler.

Declarative example:

```python
    @task(
        retries=5,
        retry_policy=ExceptionRetryPolicy(
        rules=[
            RetryRule(
            exception="requests.exceptions.HTTPError",
                    action=RetryAction.RETRY,
                    retry_delay=timedelta(minutes=5)
                ),
            RetryRule(
            exception="google.auth.exceptions.RefreshError",
                  action=RetryAction.FAIL
              ),
        ]
    ),
    )
    def call_api():
        ...
```

LLM-powered example -- uses any pydantic-ai provider (OpenAI, Anthropic,
Bedrock, Ollama):

    @task(retries=5, retry_policy=(llm_conn_id="my_llm"))
    def call_flaky_api(): ...

The LLM version classifies errors into categories (auth, rate_limit,
network, data, transient, permanent) using structured output with a
30-second timeout and declarative fallback rules for when the LLM itself
is
down.

I have attached demo videos and screenshots to both PRs showing both
policies running end-to-end in Airflow -- including the LLM correctly
classifying 4 different error types via Claude Haiku.

Full design, done criteria, and implementation details are in the wiki
page
above.

Feedback welcome.

Thanks,
Kaxil

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to