Hi All, As we gear up towards Airflow 3.3, I am excited to announce that durable/crash-safe operator execution is coming in with Airflow 3.3.
I just merged https://github.com/apache/airflow/pull/68623 which adds a *durable* flag to *ResumableJobMixin* in the task SDK version 1.3.0 which will be launched with 3.3 airflow. Wanted to share what it does and hear from the community. *What the mixin does* I have showcased this a couple times in dev calls / airflow town halls but here's a short recap for broader audience: Operators that submit a long running job to an external system (Spark, YARN, Databricks, etc.) and poll for completion have a classic retry issue: worker crashes during polling, retry submits a duplicate job because airflow is not *external* *system aware.* The mixin fixes this by persisting the external job ID to *task_state_store* before polling, which is built on AIP-103. On retry, it reads the ID back and reconnects to the running job instead of resubmitting. It is capable of handling three cases: * still active: reconnect and keep polling * already succeeded: return result immediately * terminal failure: resubmit fresh The SparkSubmitOperator has been ported over to be crash-safe across all its cluster modes: standalone, spark on yarn, spark on k8s as an initial case study because solving spark means solving most other data engineering workloads. Related PRs: * Spark Standalone: https://github.com/apache/airflow/pull/67118 * Spark on yarn: https://github.com/apache/airflow/pull/67473 * Spark on k8s: https://github.com/apache/airflow/pull/68067 *Looking for more patterns* If you maintain/work with an operator that fits the below criteria, or have hit the duplicate-job-on-retry problem, do share it here or you can even ping me on slack and we can collaborate on migrating it to be crash-safe. Criteria for a good fit: * Operator submits to an external system and polls * The job has a stable tracking ID that survives the worker process * The external system keeps the job alive independently Looking forward to hearing from the community and excited to see what gets built on top of this. Thanks & Regards, Amogh Desai
