Congrats! And thanks a lot for driving - landing this Amogh and AIP-103 team!
We have almost exact the same implementation on this internally, contributing back on the areas where we already added the Resumable features in: Starting with TriggerDagRunOperator: https://github.com/apache/airflow/pull/68936 Best, Stefan > On Jun 24, 2026, at 1:13 AM, Pierre Jeambrun <[email protected]> wrote: > > Sounds like a big win, congrats! > > On Wed 24 Jun 2026 at 08:43, Shahar Epstein <[email protected]> wrote: > >> Well done Amogh! >> Hoping to see implementations in other providers inspired by this :) >> >> >> Shahar >> >> On Tue, Jun 23, 2026, 20:17 Amogh Desai <[email protected]> wrote: >> >>> Hi All, >>> >>> As we gear up towards Airflow 3.3, I am excited to announce that >>> durable/crash-safe operator execution is >>> coming in with Airflow 3.3. >>> >>> I just merged https://github.com/apache/airflow/pull/68623 which adds a >>> *durable* flag to *ResumableJobMixin* in >>> the task SDK version 1.3.0 which will be launched with 3.3 airflow. >> Wanted >>> to share what it does and hear from the >>> community. >>> >>> *What the mixin does* >>> I have showcased this a couple times in dev calls / airflow town halls >> but >>> here's a short recap for broader audience: >>> >>> Operators that submit a long running job to an external system (Spark, >>> YARN, Databricks, etc.) and poll for completion >>> have a classic retry issue: worker crashes during polling, retry submits >> a >>> duplicate job because airflow is not *external* >>> >>> *system aware.* >>> The mixin fixes this by persisting the external job ID to >>> *task_state_store* >>> before polling, which is built on AIP-103. On retry, >>> it reads the ID back and reconnects to the running job instead of >>> resubmitting. It is capable of handling three cases: >>> >>> * still active: reconnect and keep polling >>> * already succeeded: return result immediately >>> * terminal failure: resubmit fresh >>> >>> The SparkSubmitOperator has been ported over to be crash-safe across all >>> its cluster modes: standalone, spark on yarn, spark on k8s >>> as an initial case study because solving spark means solving most other >>> data engineering workloads. Related PRs: >>> >>> * Spark Standalone: https://github.com/apache/airflow/pull/67118 >>> * Spark on yarn: https://github.com/apache/airflow/pull/67473 >>> * Spark on k8s: https://github.com/apache/airflow/pull/68067 >>> >>> *Looking for more patterns* >>> If you maintain/work with an operator that fits the below criteria, or >> have >>> hit the duplicate-job-on-retry problem, do share it here or you can >>> even ping me on slack and we can collaborate on migrating it to be >>> crash-safe. >>> >>> Criteria for a good fit: >>> * Operator submits to an external system and polls >>> * The job has a stable tracking ID that survives the worker process >>> * The external system keeps the job alive independently >>> >>> Looking forward to hearing from the community and excited to see what >> gets >>> built on top of this. >>> >>> Thanks & Regards, >>> Amogh Desai >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
