Sounds like a big win, congrats! On Wed 24 Jun 2026 at 08:43, Shahar Epstein <[email protected]> wrote:
> Well done Amogh! > Hoping to see implementations in other providers inspired by this :) > > > Shahar > > On Tue, Jun 23, 2026, 20:17 Amogh Desai <[email protected]> wrote: > > > Hi All, > > > > As we gear up towards Airflow 3.3, I am excited to announce that > > durable/crash-safe operator execution is > > coming in with Airflow 3.3. > > > > I just merged https://github.com/apache/airflow/pull/68623 which adds a > > *durable* flag to *ResumableJobMixin* in > > the task SDK version 1.3.0 which will be launched with 3.3 airflow. > Wanted > > to share what it does and hear from the > > community. > > > > *What the mixin does* > > I have showcased this a couple times in dev calls / airflow town halls > but > > here's a short recap for broader audience: > > > > Operators that submit a long running job to an external system (Spark, > > YARN, Databricks, etc.) and poll for completion > > have a classic retry issue: worker crashes during polling, retry submits > a > > duplicate job because airflow is not *external* > > > > *system aware.* > > The mixin fixes this by persisting the external job ID to > > *task_state_store* > > before polling, which is built on AIP-103. On retry, > > it reads the ID back and reconnects to the running job instead of > > resubmitting. It is capable of handling three cases: > > > > * still active: reconnect and keep polling > > * already succeeded: return result immediately > > * terminal failure: resubmit fresh > > > > The SparkSubmitOperator has been ported over to be crash-safe across all > > its cluster modes: standalone, spark on yarn, spark on k8s > > as an initial case study because solving spark means solving most other > > data engineering workloads. Related PRs: > > > > * Spark Standalone: https://github.com/apache/airflow/pull/67118 > > * Spark on yarn: https://github.com/apache/airflow/pull/67473 > > * Spark on k8s: https://github.com/apache/airflow/pull/68067 > > > > *Looking for more patterns* > > If you maintain/work with an operator that fits the below criteria, or > have > > hit the duplicate-job-on-retry problem, do share it here or you can > > even ping me on slack and we can collaborate on migrating it to be > > crash-safe. > > > > Criteria for a good fit: > > * Operator submits to an external system and polls > > * The job has a stable tracking ID that survives the worker process > > * The external system keeps the job alive independently > > > > Looking forward to hearing from the community and excited to see what > gets > > built on top of this. > > > > Thanks & Regards, > > Amogh Desai > > >
