Very nice and straightforward.

J.


On Sat, Apr 18, 2026 at 6:15 PM Jens Scheffler <[email protected]> wrote:

> Hi Kaxil,
>
> very cool proposal! Added just a few comments and would LOVE to see this
> in 3.3! This really opens the door (securely!) to extend the retry logic
> to much more use cases!
>
> Jens
>
> On 18.04.26 09:02, Kaxil Naik wrote:
> > Example in the email got cut but check the docs or demo video in 2nd PR,
> > you can either pass custom instructions or it uses the default
> instructions:
> >
> >      llm_policy = LLMRetryPolicy(
> >          llm_conn_id="pydanticai_default",
> >          instructions="..."
> >          timeout=30.0,  # max seconds to wait for LLM response
> >          fallback_rules=[  # used when LLM call fails
> >              RetryRule(exception=ConnectionError,
> action=RetryAction.RETRY,
> >                        retry_delay=timedelta(seconds=10)),
> >              RetryRule(exception=PermissionError,
> action=RetryAction.FAIL),
> >          ],
> >      )
> >
> >      @task(retries=5, retry_policy=llm_policy)
> >      def call_external_api():
> >          ...
> >
> >
> >
> https://github.com/apache/airflow/commit/effb3ef00d29a476010d502d15dcebc1cd11cdb6
> >
> https://github.com/apache/airflow/blob/effb3ef00d29a476010d502d15dcebc1cd11cdb6/providers/common/ai/src/airflow/providers/common/ai/policies/retry.py#L49-L62
> >
> >
> > On Sat, 18 Apr 2026 at 05:09, Dev-iL <[email protected]> wrote:
> >
> >> Sounds very useful!
> >>
> >> Regarding the llm-powered case: where do the system prompt or custom
> user
> >> instructions go? The only thing we specified is the connection id, yet
> the
> >> connection doesn't have a system prompt field (at least according to
> >>
> >>
> https://airflow.apache.org/docs/apache-airflow-providers-common-ai/stable/connections/pydantic_ai.html
> >> ).
> >> So how do we configure the agent to classify into nonstandard categories
> >> or  behave according to our specifications when certain types of errors
> are
> >> encountered?
> >>
> >> Best,
> >> Dev-iL
> >>
> >> On Sat, 18 Apr 2026, 3:02 Kaxil Naik, <[email protected]> wrote:
> >>
> >>> Hi all,
> >>>
> >>> Continuing the push to make Airflow AI-native, I have put together
> >> AIP-105:
> >>> Pluggable Retry Policies.
> >>>
> >>> Wiki:
> >>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies
> >>> PR (core): https://github.com/apache/airflow/pull/65450
> >>> PR (LLM-powered, common-ai provider):
> >>> https://github.com/apache/airflow/pull/65451
> >>>
> >>> The problem is straightforward: Airflow retries every failure the same
> >> way.
> >>> An expired API key gets retried 3 times over 15 minutes. A rate-limited
> >> API
> >>> gets retried immediately, hitting the same 429. Users who want smarter
> >>> retries today have to wrap every task in try/except and raise
> >>> AirflowFailException manually, mixing retry logic into business logic.
> >>>
> >>> This AIP adds a retry_policy parameter to BaseOperator. The policy
> >>> evaluates the actual exception at failure time and returns RETRY (with
> a
> >>> custom delay), FAIL (skip remaining retries), or DEFAULT (standard
> >>> behaviour). It runs in the worker process, not the scheduler.
> >>>
> >>> Declarative example:
> >>>
> >>> ```python
> >>>      @task(
> >>>          retries=5,
> >>>          retry_policy=ExceptionRetryPolicy(
> >>>          rules=[
> >>>              RetryRule(
> >>>              exception="requests.exceptions.HTTPError",
> >>>                      action=RetryAction.RETRY,
> >>>                      retry_delay=timedelta(minutes=5)
> >>>                  ),
> >>>              RetryRule(
> >>>              exception="google.auth.exceptions.RefreshError",
> >>>                    action=RetryAction.FAIL
> >>>                ),
> >>>          ]
> >>>      ),
> >>>      )
> >>>      def call_api():
> >>>          ...
> >>> ```
> >>>
> >>> LLM-powered example -- uses any pydantic-ai provider (OpenAI,
> Anthropic,
> >>> Bedrock, Ollama):
> >>>
> >>>      @task(retries=5, retry_policy=(llm_conn_id="my_llm"))
> >>>      def call_flaky_api(): ...
> >>>
> >>> The LLM version classifies errors into categories (auth, rate_limit,
> >>> network, data, transient, permanent) using structured output with a
> >>> 30-second timeout and declarative fallback rules for when the LLM
> itself
> >> is
> >>> down.
> >>>
> >>> I have attached demo videos and screenshots to both PRs showing both
> >>> policies running end-to-end in Airflow -- including the LLM correctly
> >>> classifying 4 different error types via Claude Haiku.
> >>>
> >>> Full design, done criteria, and implementation details are in the wiki
> >> page
> >>> above.
> >>>
> >>> Feedback welcome.
> >>>
> >>> Thanks,
> >>> Kaxil
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to