Re: [ANNOUNCE] Durable/crash safe operator execution from Airflow 3.3

Stefan Wang Wed, 24 Jun 2026 11:38:32 -0700

Congrats! And thanks a lot for driving - landing this Amogh and AIP-103 team!


We have almost exact the same implementation on this internally, contributing 
back on the areas where we already added the Resumable features in:

Starting with TriggerDagRunOperator: 
https://github.com/apache/airflow/pull/68936

Best,
Stefan

> On Jun 24, 2026, at 1:13 AM, Pierre Jeambrun <[email protected]> wrote:
> 
> Sounds like a big win, congrats!
> 
> On Wed 24 Jun 2026 at 08:43, Shahar Epstein <[email protected]> wrote:
> 
>> Well done Amogh!
>> Hoping to see implementations in other providers inspired by this :)
>> 
>> 
>> Shahar
>> 
>> On Tue, Jun 23, 2026, 20:17 Amogh Desai <[email protected]> wrote:
>> 
>>> Hi All,
>>> 
>>> As we gear up towards Airflow 3.3, I am excited to announce that
>>> durable/crash-safe operator execution is
>>> coming in with Airflow 3.3.
>>> 
>>> I just merged https://github.com/apache/airflow/pull/68623 which adds a
>>> *durable* flag to *ResumableJobMixin* in
>>> the task SDK version 1.3.0 which will be launched with 3.3 airflow.
>> Wanted
>>> to share what it does and hear from the
>>> community.
>>> 
>>> *What the mixin does*
>>> I have showcased this a couple times in dev calls / airflow town halls
>> but
>>> here's a short recap for broader audience:
>>> 
>>> Operators that submit a long running job to an external system (Spark,
>>> YARN, Databricks, etc.) and poll for completion
>>> have a classic retry issue: worker crashes during polling, retry submits
>> a
>>> duplicate job because airflow is not *external*
>>> 
>>> *system aware.*
>>> The mixin fixes this by persisting the external job ID to
>>> *task_state_store*
>>> before polling, which is built on AIP-103. On retry,
>>> it reads the ID back and reconnects to the running job instead of
>>> resubmitting. It is capable of handling three cases:
>>> 
>>> * still active: reconnect and keep polling
>>> * already succeeded: return result immediately
>>> * terminal failure: resubmit fresh
>>> 
>>> The SparkSubmitOperator has been ported over to be crash-safe across all
>>> its cluster modes: standalone, spark on yarn, spark on k8s
>>> as an initial case study because solving spark means solving most other
>>> data engineering workloads. Related PRs:
>>> 
>>> * Spark Standalone: https://github.com/apache/airflow/pull/67118
>>> * Spark on yarn: https://github.com/apache/airflow/pull/67473
>>> * Spark on k8s: https://github.com/apache/airflow/pull/68067
>>> 
>>> *Looking for more patterns*
>>> If you maintain/work with an operator that fits the below criteria, or
>> have
>>> hit the duplicate-job-on-retry problem, do share it here or you can
>>> even ping me on slack and we can collaborate on migrating it to be
>>> crash-safe.
>>> 
>>> Criteria for a good fit:
>>> * Operator submits to an external system and polls
>>> * The job has a stable tracking ID that survives the worker process
>>> * The external system keeps the job alive independently
>>> 
>>> Looking forward to hearing from the community and excited to see what
>> gets
>>> built on top of this.
>>> 
>>> Thanks & Regards,
>>> Amogh Desai
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [ANNOUNCE] Durable/crash safe operator execution from Airflow 3.3

Reply via email to