Thanks all, given the general consensus around the use-case and high-level
implementation, I am going to start the vote

On Wed, 22 Apr 2026 at 18:36, Blain David <[email protected]> wrote:

> I really like this direction — it’s something I’ve been thinking about as
> well, although from a slightly different angle.
> I’ve been considering starting a discussion around making retry behavior
> more dynamic based on runtime context, rather than introducing AI
> specifically. The current model is quite static: we retry blindly based on
> configuration, without considering why the failure happened or what the
> system state looks like at that moment.
> What I find compelling in this AIP is the shift toward failure-aware
> retries. That aligns closely with the idea of making DAGs more resilient —
> not just retrying in the hope of eventual success, but making a more
> informed decision based on the nature of the failure.
> One thing I’d be interested in exploring further is how far we can push
> this in a deterministic/runtime-driven way (e.g. exception type, response
> metadata, external signals like rate limits or downstream system health),
> and how that compares to or complements the LLM-based approach.
> Overall, this feels like a strong step toward decoupling retry logic from
> business logic, which is definitely a gap today.
> Very nice proposal Kaxil, so definitely +1 for me.
>
>
> ________________________________
> From: Kaxil Naik <[email protected]>
> Sent: Saturday, April 18, 2026 02:01
> To: [email protected] <[email protected]>
> Subject: [DISCUSS] AIP-105: Pluggable Retry Policies
>
> EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze
> niet vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel,
> stuur deze e-mail als bijlage naar [email protected]<mailto:
> [email protected]>.
>
> Hi all,
>
> Continuing the push to make Airflow AI-native, I have put together AIP-105:
> Pluggable Retry Policies.
>
> Wiki:
>
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-105%253A%2BPluggable%2BRetry%2BPolicies&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673615964835%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BkJAEKtqo6XVwaunH4ycZm7mYzjNURYRbkMvYvnkvSM%3D&reserved=0
> <
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies
> >
> PR (core):
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F65450&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673615992475%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=MRv8LGaBeRF6LXf54vG6U5HuoHKf4o%2FCCJgJw9JiUUE%3D&reserved=0
> <https://github.com/apache/airflow/pull/65450>
> PR (LLM-powered, common-ai provider):
>
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F65451&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673616006007%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=jjaGhzjVDKgPVjIUDz8eb9m%2B8qpm9Pfk1HgYfO45%2B78%3D&reserved=0
> <https://github.com/apache/airflow/pull/65451>
>
> The problem is straightforward: Airflow retries every failure the same way.
> An expired API key gets retried 3 times over 15 minutes. A rate-limited API
> gets retried immediately, hitting the same 429. Users who want smarter
> retries today have to wrap every task in try/except and raise
> AirflowFailException manually, mixing retry logic into business logic.
>
> This AIP adds a retry_policy parameter to BaseOperator. The policy
> evaluates the actual exception at failure time and returns RETRY (with a
> custom delay), FAIL (skip remaining retries), or DEFAULT (standard
> behaviour). It runs in the worker process, not the scheduler.
>
> Declarative example:
>
> ```python
>     @task(
>         retries=5,
>         retry_policy=ExceptionRetryPolicy(
>         rules=[
>             RetryRule(
>             exception="requests.exceptions.HTTPError",
>                     action=RetryAction.RETRY,
>                     retry_delay=timedelta(minutes=5)
>                 ),
>             RetryRule(
>             exception="google.auth.exceptions.RefreshError",
>                   action=RetryAction.FAIL
>               ),
>         ]
>     ),
>     )
>     def call_api():
>         ...
> ```
>
> LLM-powered example -- uses any pydantic-ai provider (OpenAI, Anthropic,
> Bedrock, Ollama):
>
>     @task(retries=5, retry_policy=(llm_conn_id="my_llm"))
>     def call_flaky_api(): ...
>
> The LLM version classifies errors into categories (auth, rate_limit,
> network, data, transient, permanent) using structured output with a
> 30-second timeout and declarative fallback rules for when the LLM itself is
> down.
>
> I have attached demo videos and screenshots to both PRs showing both
> policies running end-to-end in Airflow -- including the LLM correctly
> classifying 4 different error types via Claude Haiku.
>
> Full design, done criteria, and implementation details are in the wiki page
> above.
>
> Feedback welcome.
>
> Thanks,
> Kaxil
>

Reply via email to