Re: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)

Yuseok Jo Sun, 26 Apr 2026 08:15:17 -0700

I strongly agree with the principle that tasks should ideally be designed
to be idempotent at the task level. The alternatives you suggested look
genuinely useful for considering this issue.


   - Setup/Teardown fits well when the main concern is bracketing a
   pipeline with preparation/finalization, though it doesn't directly address
   failures in intermediate tasks.
   - The on_failure_callback approach seems like something that can serve
   other Airflow users with the same need through documentation alone, without
   any code changes.
   - QualityCheckOperator aligns better with data-quality validation than
   with arbitrary task-failure recovery, though the underlying "clear via API"
   building block it relies on is shared with the callback approach.
   - *TransactionTaskGroup* is an intriguing idea. As I understand it, it
   would be a TaskGroup with roughly the following behavior:
      - If any task within the group ultimately fails, the entire group
      becomes the target for clearing & retrying (following the DAG-level retry
      policy)
      - Tasks outside the group are unaffected → partial application is
      possible
      - Extending the existing TaskGroup feels like a natural shape
      - And simply placing all tasks of a DAG into a single such group
      would produce the same effect as the original request.

That said, to be transparent: I was not a strong stakeholder in this issue
myself. The original reporter went silent and I escalated this to the
devlist on their behalf, so I was not in a great position to advocate for
the use case's urgency. Apologies also for the slow reply.

Given that, here is a reasonable direction:

   - Short term / immediate value: documenting the on_failure_callback +
   clear-API pattern as a how-to or example would help other users with the
   same need right away. Happy to put up a small PR for this.
   - Longer term: *TransactionTaskGroup* feels like it has value beyond
   this specific issue. I'd be glad to contribute.

Thanks again for the detailed and thoughtful response.
It really helped clarify things.

On Mon, Apr 20, 2026 at 5:25 AM Jens Scheffler <[email protected]> wrote:

> Hi,
>
> as nobody else was answering on the DISCUSS let me try to break the ice.
> I was commenting on the PR already.
>
> I am not a big fan of adding more parameters for the retry as I assume a
> lot of options are already existing. Yes and mainly on task level.
>
> My proposal in general would be to model a pipeline in a way that all
> tasks are idempotent and not the full pipeline needs to be retried. This
> is in a matter of cost as well as a matter of time. If you need to run
> the full chain then this either smells like the pipeline is badly
> modelled as e.g. tasks are not idempotent or it is actually a re-run
> with changed parameters (maybe it has been started wrong). A technical
> need to re-run all ... might be also a backfill case? So I am not seeing
> a strong case that would have been missed as a feature in the last 10
> years.
>
> If there actually is (and please convince me of any reason with the
> right arguments) then I'd still would ask to consider the following
> options:
>
>   * Is the workflow actually mainly requiring to make something before
>     as preparation and maybe something as finalization? Then the
>     "Startup/Teardown" tasks might be a good composite. Especially if
>     the pipeline is only 3 tasks then you can use this to ensure all is
>     re-running
>   * You could also attempt to fix this without changes in the scheduler
>     via a on_failure_callback (see
>
> https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/callbacks.html#callback-types
> )
>     and hook a function that clears all tasks via API - and attach this
>     callback as default to all tasks or to the Dag at the end.
>   * Instead of extending the Dag and Scheduler logic I would imagine
>     there might be an option to implement a "QualityCheckOperator" that
>     takes a condition and in case of not meeting quality criteria then
>     makes a "Clear DagRun" via API. This would not require additional
>     Dag parameters and would not need any extensions on the scheduler
>     but via API could be called from an Operator as alternative.
>   * I could also imagine that the request raised was namig a Dag but
>     then a moment later somebody will have the same with a set of tasks
>     only. So an alternative as well could be having a
>     "TransactionTaskGroup" which would call all tasks in that task group
>     being somehow a combined transaction. If one is cleared or one needs
>     a retry, all together are retried. Then you could apply this to a
>     subset of tasks or if all tasks are in that group for the full Dag.
>
> So if the reporter is silent now then we might need to get the original
> voice and see if one of the options are already a solution to the
> problem. Happy to be convinced.
>
> Jens
>
> On 08.04.26 22:12, Przemysław Mirowski wrote:
> > Hello,
> >
> > I checked the discussion and I don't really see any real use case where
> that could be potentially needed. The tasks currently can send some data
> between their executions via xcom or some other methods implemented in task
> logic, but these data should rather not change if the input didn't change
> (e.g. from upstream tasks), so the retrying on task level should be
> sufficient.
> >
> >> One user-side story I can picture is ML-style pipelines where a final
> validation or evaluation step fails and teams want a full rerun of the run
> instead of only retrying failed tasks.
> > Failure within the ML pipeline, IMHO would only require the retry on
> task level as the e.g. models, after training, should be saved and used by
> other tasks. Potential issue which I would see (within the ML pipelines)
> would be when the task itself would fail and retrying whole operation is
> expensive, but that part could be solved after AIP-103.
> >
> > Maybe the only need for retrying everything (without thinking
> Airflow-specific) would be e.g. some time-series or streaming-related cases
> where after a failure somewhere, whole processing becomes invalid
> (basically the operations where there is no possibility of process design
> which would allow for only retrying the part of it).
> >
> >> Do you feel this need in practice?/do you see it as something that
> belongs in core?
> > Not really, at least for now.
> >
> >> How do you work around it today?
> > Designing the processes in a way were only task-level are needed if
> failure occur.
> >
> > Regards,
> > PM
> >
> > ________________________________
> > From: Yuseok Jo<[email protected]>
> > Sent: 07 April 2026 15:07
> > To:[email protected] <[email protected]>
> > Subject: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)
> >
> > Hello community,
> >
> > I would like to pick up discussion on GitHub issue 60866 about DAG-level
> > automatic retries or rerunning a whole DAG run from the start when a
> > terminal task fails or the DAG run ends in a certain state.
> > https://github.com/apache/airflow/issues/60866
> >
> > I am not the person who originally opened that issue, and the original
> > author may not be active now. I am unsure whether this is a real gap for
> > users or something we should handle with patterns we already have.
> >
> > One user-side story I can picture is ML-style pipelines where a final
> > validation or evaluation step fails and teams want a full rerun of the
> run
> > instead of only retrying failed tasks. This is just one possible
> scenario.
> > Other domains may have similar needs.
> >
> > I am not proposing a core change yet. I mainly want light feedback on
> three
> > points.
> > Do you feel this need in practice?
> > How do you work around it today?
> > And do you see it as something that belongs in core?
> >
> > Thanks,
> > Yuseok Jo
> >

Re: [DISCUSS] Feedback on DAG-level full-run retries (issue 60866)

Reply via email to