Makes a lot of sense to me!

On 2026/04/19 13:56:56 Elad Kalif wrote:
> Great idea!
> Love it!
> 
> I have some questions / comments:
> 1. The current interface suggests rules that contain a RetryRule object.
> but I wonder if we should change exception to exceptions and accepting a
> list.
> 
>         rules=[
>             RetryRule(
>             exceptions=["requests.exceptions.HTTPError",
> "google.auth.exceptions.RefreshError"]
>                     ...,
> )]
> 
> I'm thinking about a case where several exceptions need the same behaviour
> and user may not wish to offer different reasoning for each.
> 
> 2. Does it make sense to extend the interface for xcom values? I'm thinking
> about a case where dag authors don't have full control over the exception
> raised or even some upstream library changing the exception which results
> in retry logic to be broken. Maybe we should offer also the option to set
> retry based on previous attempt xcom value?
> 
> 3. Maybe something for the longer run but still worth discussing - one of
> the main motivations for custom weight rules
> https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/priority-weight.html#custom-weight-rule
> was to set priority based on try number. I wonder if we may want to somehow
> combine it with the Retry rule. For retries, I can argue that the weight of
> the task is a property of retry instructions and it can very be that the
> weight will change depending on the exception.
> 
> On Sun, Apr 19, 2026 at 6:30 AM Shahar Epstein <[email protected]> wrote:
> 
> > Great idea! I liked both the deterministic approach as well as the AI
> > integrated.
> >
> >
> > Shahar
> >
> > On Sat, Apr 18, 2026 at 3:02 AM Kaxil Naik <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > Continuing the push to make Airflow AI-native, I have put together
> > AIP-105:
> > > Pluggable Retry Policies.
> > >
> > > Wiki:
> > >
> > >
> > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies
> > > PR (core): https://github.com/apache/airflow/pull/65450
> > > PR (LLM-powered, common-ai provider):
> > > https://github.com/apache/airflow/pull/65451
> > >
> > > The problem is straightforward: Airflow retries every failure the same
> > way.
> > > An expired API key gets retried 3 times over 15 minutes. A rate-limited
> > API
> > > gets retried immediately, hitting the same 429. Users who want smarter
> > > retries today have to wrap every task in try/except and raise
> > > AirflowFailException manually, mixing retry logic into business logic.
> > >
> > > This AIP adds a retry_policy parameter to BaseOperator. The policy
> > > evaluates the actual exception at failure time and returns RETRY (with a
> > > custom delay), FAIL (skip remaining retries), or DEFAULT (standard
> > > behaviour). It runs in the worker process, not the scheduler.
> > >
> > > Declarative example:
> > >
> > > ```python
> > >     @task(
> > >         retries=5,
> > >         retry_policy=ExceptionRetryPolicy(
> > >         rules=[
> > >             RetryRule(
> > >             exception="requests.exceptions.HTTPError",
> > >                     action=RetryAction.RETRY,
> > >                     retry_delay=timedelta(minutes=5)
> > >                 ),
> > >             RetryRule(
> > >             exception="google.auth.exceptions.RefreshError",
> > >                   action=RetryAction.FAIL
> > >               ),
> > >         ]
> > >     ),
> > >     )
> > >     def call_api():
> > >         ...
> > > ```
> > >
> > > LLM-powered example -- uses any pydantic-ai provider (OpenAI, Anthropic,
> > > Bedrock, Ollama):
> > >
> > >     @task(retries=5, retry_policy=(llm_conn_id="my_llm"))
> > >     def call_flaky_api(): ...
> > >
> > > The LLM version classifies errors into categories (auth, rate_limit,
> > > network, data, transient, permanent) using structured output with a
> > > 30-second timeout and declarative fallback rules for when the LLM itself
> > is
> > > down.
> > >
> > > I have attached demo videos and screenshots to both PRs showing both
> > > policies running end-to-end in Airflow -- including the LLM correctly
> > > classifying 4 different error types via Claude Haiku.
> > >
> > > Full design, done criteria, and implementation details are in the wiki
> > page
> > > above.
> > >
> > > Feedback welcome.
> > >
> > > Thanks,
> > > Kaxil
> > >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to