Sounds like a big win, congrats!

On Wed 24 Jun 2026 at 08:43, Shahar Epstein <[email protected]> wrote:

> Well done Amogh!
> Hoping to see implementations in other providers inspired by this :)
>
>
> Shahar
>
> On Tue, Jun 23, 2026, 20:17 Amogh Desai <[email protected]> wrote:
>
> > Hi All,
> >
> > As we gear up towards Airflow 3.3, I am excited to announce that
> > durable/crash-safe operator execution is
> > coming in with Airflow 3.3.
> >
> > I just merged https://github.com/apache/airflow/pull/68623 which adds a
> > *durable* flag to *ResumableJobMixin* in
> > the task SDK version 1.3.0 which will be launched with 3.3 airflow.
> Wanted
> > to share what it does and hear from the
> > community.
> >
> > *What the mixin does*
> > I have showcased this a couple times in dev calls / airflow town halls
> but
> > here's a short recap for broader audience:
> >
> > Operators that submit a long running job to an external system (Spark,
> > YARN, Databricks, etc.) and poll for completion
> > have a classic retry issue: worker crashes during polling, retry submits
> a
> > duplicate job because airflow is not *external*
> >
> > *system aware.*
> > The mixin fixes this by persisting the external job ID to
> > *task_state_store*
> > before polling, which is built on AIP-103. On retry,
> > it reads the ID back and reconnects to the running job instead of
> > resubmitting. It is capable of handling three cases:
> >
> > * still active: reconnect and keep polling
> > * already succeeded: return result immediately
> > * terminal failure: resubmit fresh
> >
> > The SparkSubmitOperator has been ported over to be crash-safe across all
> > its cluster modes: standalone, spark on yarn, spark on k8s
> > as an initial case study because solving spark means solving most other
> > data engineering workloads. Related PRs:
> >
> > * Spark Standalone: https://github.com/apache/airflow/pull/67118
> > * Spark on yarn: https://github.com/apache/airflow/pull/67473
> > * Spark on k8s: https://github.com/apache/airflow/pull/68067
> >
> > *Looking for more patterns*
> > If you maintain/work with an operator that fits the below criteria, or
> have
> > hit the duplicate-job-on-retry problem, do share it here or you can
> > even ping me on slack and we can collaborate on migrating it to be
> > crash-safe.
> >
> > Criteria for a good fit:
> > * Operator submits to an external system and polls
> > * The job has a stable tracking ID that survives the worker process
> > * The external system keeps the job alive independently
> >
> > Looking forward to hearing from the community and excited to see what
> gets
> > built on top of this.
> >
> > Thanks & Regards,
> > Amogh Desai
> >
>

Reply via email to