A bit late to the party, but very well drafted -- identifying a real problem
and authoring an AIP very minimalistically and to the point.


Thanks & Regards,
Amogh Desai


On Thu, Apr 23, 2026 at 1:57 AM Kaxil Naik <[email protected]> wrote:

> Thanks all, given the general consensus around the use-case and high-level
> implementation, I am going to start the vote
>
> On Wed, 22 Apr 2026 at 18:36, Blain David <[email protected]> wrote:
>
> > I really like this direction — it’s something I’ve been thinking about as
> > well, although from a slightly different angle.
> > I’ve been considering starting a discussion around making retry behavior
> > more dynamic based on runtime context, rather than introducing AI
> > specifically. The current model is quite static: we retry blindly based
> on
> > configuration, without considering why the failure happened or what the
> > system state looks like at that moment.
> > What I find compelling in this AIP is the shift toward failure-aware
> > retries. That aligns closely with the idea of making DAGs more resilient
> —
> > not just retrying in the hope of eventual success, but making a more
> > informed decision based on the nature of the failure.
> > One thing I’d be interested in exploring further is how far we can push
> > this in a deterministic/runtime-driven way (e.g. exception type, response
> > metadata, external signals like rate limits or downstream system health),
> > and how that compares to or complements the LLM-based approach.
> > Overall, this feels like a strong step toward decoupling retry logic from
> > business logic, which is definitely a gap today.
> > Very nice proposal Kaxil, so definitely +1 for me.
> >
> >
> > ________________________________
> > From: Kaxil Naik <[email protected]>
> > Sent: Saturday, April 18, 2026 02:01
> > To: [email protected] <[email protected]>
> > Subject: [DISCUSS] AIP-105: Pluggable Retry Policies
> >
> > EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze
> > niet vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel,
> > stuur deze e-mail als bijlage naar [email protected]<mailto:
> > [email protected]>.
> >
> > Hi all,
> >
> > Continuing the push to make Airflow AI-native, I have put together
> AIP-105:
> > Pluggable Retry Policies.
> >
> > Wiki:
> >
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-105%253A%2BPluggable%2BRetry%2BPolicies&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673615964835%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BkJAEKtqo6XVwaunH4ycZm7mYzjNURYRbkMvYvnkvSM%3D&reserved=0
> > <
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies
> > >
> > PR (core):
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F65450&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673615992475%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=MRv8LGaBeRF6LXf54vG6U5HuoHKf4o%2FCCJgJw9JiUUE%3D&reserved=0
> > <https://github.com/apache/airflow/pull/65450>
> > PR (LLM-powered, common-ai provider):
> >
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F65451&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673616006007%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=jjaGhzjVDKgPVjIUDz8eb9m%2B8qpm9Pfk1HgYfO45%2B78%3D&reserved=0
> > <https://github.com/apache/airflow/pull/65451>
> >
> > The problem is straightforward: Airflow retries every failure the same
> way.
> > An expired API key gets retried 3 times over 15 minutes. A rate-limited
> API
> > gets retried immediately, hitting the same 429. Users who want smarter
> > retries today have to wrap every task in try/except and raise
> > AirflowFailException manually, mixing retry logic into business logic.
> >
> > This AIP adds a retry_policy parameter to BaseOperator. The policy
> > evaluates the actual exception at failure time and returns RETRY (with a
> > custom delay), FAIL (skip remaining retries), or DEFAULT (standard
> > behaviour). It runs in the worker process, not the scheduler.
> >
> > Declarative example:
> >
> > ```python
> >     @task(
> >         retries=5,
> >         retry_policy=ExceptionRetryPolicy(
> >         rules=[
> >             RetryRule(
> >             exception="requests.exceptions.HTTPError",
> >                     action=RetryAction.RETRY,
> >                     retry_delay=timedelta(minutes=5)
> >                 ),
> >             RetryRule(
> >             exception="google.auth.exceptions.RefreshError",
> >                   action=RetryAction.FAIL
> >               ),
> >         ]
> >     ),
> >     )
> >     def call_api():
> >         ...
> > ```
> >
> > LLM-powered example -- uses any pydantic-ai provider (OpenAI, Anthropic,
> > Bedrock, Ollama):
> >
> >     @task(retries=5, retry_policy=(llm_conn_id="my_llm"))
> >     def call_flaky_api(): ...
> >
> > The LLM version classifies errors into categories (auth, rate_limit,
> > network, data, transient, permanent) using structured output with a
> > 30-second timeout and declarative fallback rules for when the LLM itself
> is
> > down.
> >
> > I have attached demo videos and screenshots to both PRs showing both
> > policies running end-to-end in Airflow -- including the LLM correctly
> > classifying 4 different error types via Claude Haiku.
> >
> > Full design, done criteria, and implementation details are in the wiki
> page
> > above.
> >
> > Feedback welcome.
> >
> > Thanks,
> > Kaxil
> >
>

Reply via email to