A bit late to the party, but very well drafted -- identifying a real problem and authoring an AIP very minimalistically and to the point.
Thanks & Regards, Amogh Desai On Thu, Apr 23, 2026 at 1:57 AM Kaxil Naik <[email protected]> wrote: > Thanks all, given the general consensus around the use-case and high-level > implementation, I am going to start the vote > > On Wed, 22 Apr 2026 at 18:36, Blain David <[email protected]> wrote: > > > I really like this direction — it’s something I’ve been thinking about as > > well, although from a slightly different angle. > > I’ve been considering starting a discussion around making retry behavior > > more dynamic based on runtime context, rather than introducing AI > > specifically. The current model is quite static: we retry blindly based > on > > configuration, without considering why the failure happened or what the > > system state looks like at that moment. > > What I find compelling in this AIP is the shift toward failure-aware > > retries. That aligns closely with the idea of making DAGs more resilient > — > > not just retrying in the hope of eventual success, but making a more > > informed decision based on the nature of the failure. > > One thing I’d be interested in exploring further is how far we can push > > this in a deterministic/runtime-driven way (e.g. exception type, response > > metadata, external signals like rate limits or downstream system health), > > and how that compares to or complements the LLM-based approach. > > Overall, this feels like a strong step toward decoupling retry logic from > > business logic, which is definitely a gap today. > > Very nice proposal Kaxil, so definitely +1 for me. > > > > > > ________________________________ > > From: Kaxil Naik <[email protected]> > > Sent: Saturday, April 18, 2026 02:01 > > To: [email protected] <[email protected]> > > Subject: [DISCUSS] AIP-105: Pluggable Retry Policies > > > > EXTERNAL MAIL: Indien je de afzender van deze e-mail niet kent en deze > > niet vertrouwt, klik niet op een link of open geen bijlages. Bij twijfel, > > stuur deze e-mail als bijlage naar [email protected]<mailto: > > [email protected]>. > > > > Hi all, > > > > Continuing the push to make Airflow AI-native, I have put together > AIP-105: > > Pluggable Retry Policies. > > > > Wiki: > > > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FAIRFLOW%2FAIP-105%253A%2BPluggable%2BRetry%2BPolicies&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673615964835%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BkJAEKtqo6XVwaunH4ycZm7mYzjNURYRbkMvYvnkvSM%3D&reserved=0 > > < > > > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies > > > > > PR (core): > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F65450&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673615992475%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=MRv8LGaBeRF6LXf54vG6U5HuoHKf4o%2FCCJgJw9JiUUE%3D&reserved=0 > > <https://github.com/apache/airflow/pull/65450> > > PR (LLM-powered, common-ai provider): > > > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fairflow%2Fpull%2F65451&data=05%7C02%7Cdavid.blain%40infrabel.be%7C08fa091c1ec64b36829b08de9cddcc96%7Cb82bc314ab8e4d6fb18946f02e1f27f2%7C0%7C0%7C639120673616006007%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=jjaGhzjVDKgPVjIUDz8eb9m%2B8qpm9Pfk1HgYfO45%2B78%3D&reserved=0 > > <https://github.com/apache/airflow/pull/65451> > > > > The problem is straightforward: Airflow retries every failure the same > way. > > An expired API key gets retried 3 times over 15 minutes. A rate-limited > API > > gets retried immediately, hitting the same 429. Users who want smarter > > retries today have to wrap every task in try/except and raise > > AirflowFailException manually, mixing retry logic into business logic. > > > > This AIP adds a retry_policy parameter to BaseOperator. The policy > > evaluates the actual exception at failure time and returns RETRY (with a > > custom delay), FAIL (skip remaining retries), or DEFAULT (standard > > behaviour). It runs in the worker process, not the scheduler. > > > > Declarative example: > > > > ```python > > @task( > > retries=5, > > retry_policy=ExceptionRetryPolicy( > > rules=[ > > RetryRule( > > exception="requests.exceptions.HTTPError", > > action=RetryAction.RETRY, > > retry_delay=timedelta(minutes=5) > > ), > > RetryRule( > > exception="google.auth.exceptions.RefreshError", > > action=RetryAction.FAIL > > ), > > ] > > ), > > ) > > def call_api(): > > ... > > ``` > > > > LLM-powered example -- uses any pydantic-ai provider (OpenAI, Anthropic, > > Bedrock, Ollama): > > > > @task(retries=5, retry_policy=(llm_conn_id="my_llm")) > > def call_flaky_api(): ... > > > > The LLM version classifies errors into categories (auth, rate_limit, > > network, data, transient, permanent) using structured output with a > > 30-second timeout and declarative fallback rules for when the LLM itself > is > > down. > > > > I have attached demo videos and screenshots to both PRs showing both > > policies running end-to-end in Airflow -- including the LLM correctly > > classifying 4 different error types via Claude Haiku. > > > > Full design, done criteria, and implementation details are in the wiki > page > > above. > > > > Feedback welcome. > > > > Thanks, > > Kaxil > > >
