Well done Amogh! Hoping to see implementations in other providers inspired by this :)
Shahar On Tue, Jun 23, 2026, 20:17 Amogh Desai <[email protected]> wrote: > Hi All, > > As we gear up towards Airflow 3.3, I am excited to announce that > durable/crash-safe operator execution is > coming in with Airflow 3.3. > > I just merged https://github.com/apache/airflow/pull/68623 which adds a > *durable* flag to *ResumableJobMixin* in > the task SDK version 1.3.0 which will be launched with 3.3 airflow. Wanted > to share what it does and hear from the > community. > > *What the mixin does* > I have showcased this a couple times in dev calls / airflow town halls but > here's a short recap for broader audience: > > Operators that submit a long running job to an external system (Spark, > YARN, Databricks, etc.) and poll for completion > have a classic retry issue: worker crashes during polling, retry submits a > duplicate job because airflow is not *external* > > *system aware.* > The mixin fixes this by persisting the external job ID to > *task_state_store* > before polling, which is built on AIP-103. On retry, > it reads the ID back and reconnects to the running job instead of > resubmitting. It is capable of handling three cases: > > * still active: reconnect and keep polling > * already succeeded: return result immediately > * terminal failure: resubmit fresh > > The SparkSubmitOperator has been ported over to be crash-safe across all > its cluster modes: standalone, spark on yarn, spark on k8s > as an initial case study because solving spark means solving most other > data engineering workloads. Related PRs: > > * Spark Standalone: https://github.com/apache/airflow/pull/67118 > * Spark on yarn: https://github.com/apache/airflow/pull/67473 > * Spark on k8s: https://github.com/apache/airflow/pull/68067 > > *Looking for more patterns* > If you maintain/work with an operator that fits the below criteria, or have > hit the duplicate-job-on-retry problem, do share it here or you can > even ping me on slack and we can collaborate on migrating it to be > crash-safe. > > Criteria for a good fit: > * Operator submits to an external system and polls > * The job has a stable tracking ID that survives the worker process > * The external system keeps the job alive independently > > Looking forward to hearing from the community and excited to see what gets > built on top of this. > > Thanks & Regards, > Amogh Desai >
