Re: [DISCUSS] AIP-105: Pluggable Retry Policies

Jarek Potiuk Sat, 18 Apr 2026 11:32:32 -0700

Though.. of course need to take a deep look - but that sounds like "big"
"small" improvement :)


On Sat, Apr 18, 2026 at 8:29 PM Jarek Potiuk <[email protected]> wrote:

> Very nice and straightforward.
>
> J.
>
>
> On Sat, Apr 18, 2026 at 6:15 PM Jens Scheffler <[email protected]>
> wrote:
>
>> Hi Kaxil,
>>
>> very cool proposal! Added just a few comments and would LOVE to see this
>> in 3.3! This really opens the door (securely!) to extend the retry logic
>> to much more use cases!
>>
>> Jens
>>
>> On 18.04.26 09:02, Kaxil Naik wrote:
>> > Example in the email got cut but check the docs or demo video in 2nd PR,
>> > you can either pass custom instructions or it uses the default
>> instructions:
>> >
>> >      llm_policy = LLMRetryPolicy(
>> >          llm_conn_id="pydanticai_default",
>> >          instructions="..."
>> >          timeout=30.0,  # max seconds to wait for LLM response
>> >          fallback_rules=[  # used when LLM call fails
>> >              RetryRule(exception=ConnectionError,
>> action=RetryAction.RETRY,
>> >                        retry_delay=timedelta(seconds=10)),
>> >              RetryRule(exception=PermissionError,
>> action=RetryAction.FAIL),
>> >          ],
>> >      )
>> >
>> >      @task(retries=5, retry_policy=llm_policy)
>> >      def call_external_api():
>> >          ...
>> >
>> >
>> >
>> https://github.com/apache/airflow/commit/effb3ef00d29a476010d502d15dcebc1cd11cdb6
>> >
>> https://github.com/apache/airflow/blob/effb3ef00d29a476010d502d15dcebc1cd11cdb6/providers/common/ai/src/airflow/providers/common/ai/policies/retry.py#L49-L62
>> >
>> >
>> > On Sat, 18 Apr 2026 at 05:09, Dev-iL <[email protected]> wrote:
>> >
>> >> Sounds very useful!
>> >>
>> >> Regarding the llm-powered case: where do the system prompt or custom
>> user
>> >> instructions go? The only thing we specified is the connection id, yet
>> the
>> >> connection doesn't have a system prompt field (at least according to
>> >>
>> >>
>> https://airflow.apache.org/docs/apache-airflow-providers-common-ai/stable/connections/pydantic_ai.html
>> >> ).
>> >> So how do we configure the agent to classify into nonstandard
>> categories
>> >> or  behave according to our specifications when certain types of
>> errors are
>> >> encountered?
>> >>
>> >> Best,
>> >> Dev-iL
>> >>
>> >> On Sat, 18 Apr 2026, 3:02 Kaxil Naik, <[email protected]> wrote:
>> >>
>> >>> Hi all,
>> >>>
>> >>> Continuing the push to make Airflow AI-native, I have put together
>> >> AIP-105:
>> >>> Pluggable Retry Policies.
>> >>>
>> >>> Wiki:
>> >>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-105%3A+Pluggable+Retry+Policies
>> >>> PR (core): https://github.com/apache/airflow/pull/65450
>> >>> PR (LLM-powered, common-ai provider):
>> >>> https://github.com/apache/airflow/pull/65451
>> >>>
>> >>> The problem is straightforward: Airflow retries every failure the same
>> >> way.
>> >>> An expired API key gets retried 3 times over 15 minutes. A
>> rate-limited
>> >> API
>> >>> gets retried immediately, hitting the same 429. Users who want smarter
>> >>> retries today have to wrap every task in try/except and raise
>> >>> AirflowFailException manually, mixing retry logic into business logic.
>> >>>
>> >>> This AIP adds a retry_policy parameter to BaseOperator. The policy
>> >>> evaluates the actual exception at failure time and returns RETRY
>> (with a
>> >>> custom delay), FAIL (skip remaining retries), or DEFAULT (standard
>> >>> behaviour). It runs in the worker process, not the scheduler.
>> >>>
>> >>> Declarative example:
>> >>>
>> >>> ```python
>> >>>      @task(
>> >>>          retries=5,
>> >>>          retry_policy=ExceptionRetryPolicy(
>> >>>          rules=[
>> >>>              RetryRule(
>> >>>              exception="requests.exceptions.HTTPError",
>> >>>                      action=RetryAction.RETRY,
>> >>>                      retry_delay=timedelta(minutes=5)
>> >>>                  ),
>> >>>              RetryRule(
>> >>>              exception="google.auth.exceptions.RefreshError",
>> >>>                    action=RetryAction.FAIL
>> >>>                ),
>> >>>          ]
>> >>>      ),
>> >>>      )
>> >>>      def call_api():
>> >>>          ...
>> >>> ```
>> >>>
>> >>> LLM-powered example -- uses any pydantic-ai provider (OpenAI,
>> Anthropic,
>> >>> Bedrock, Ollama):
>> >>>
>> >>>      @task(retries=5, retry_policy=(llm_conn_id="my_llm"))
>> >>>      def call_flaky_api(): ...
>> >>>
>> >>> The LLM version classifies errors into categories (auth, rate_limit,
>> >>> network, data, transient, permanent) using structured output with a
>> >>> 30-second timeout and declarative fallback rules for when the LLM
>> itself
>> >> is
>> >>> down.
>> >>>
>> >>> I have attached demo videos and screenshots to both PRs showing both
>> >>> policies running end-to-end in Airflow -- including the LLM correctly
>> >>> classifying 4 different error types via Claude Haiku.
>> >>>
>> >>> Full design, done criteria, and implementation details are in the wiki
>> >> page
>> >>> above.
>> >>>
>> >>> Feedback welcome.
>> >>>
>> >>> Thanks,
>> >>> Kaxil
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: [DISCUSS] AIP-105: Pluggable Retry Policies

Reply via email to