Hi All,

As we gear up towards Airflow 3.3, I am excited to announce that
durable/crash-safe operator execution is
coming in with Airflow 3.3.

I just merged https://github.com/apache/airflow/pull/68623 which adds a
*durable* flag to *ResumableJobMixin* in
the task SDK version 1.3.0 which will be launched with 3.3 airflow. Wanted
to share what it does and hear from the
community.

*What the mixin does*
I have showcased this a couple times in dev calls / airflow town halls but
here's a short recap for broader audience:

Operators that submit a long running job to an external system (Spark,
YARN, Databricks, etc.) and poll for completion
have a classic retry issue: worker crashes during polling, retry submits a
duplicate job because airflow is not *external*

*system aware.*
The mixin fixes this by persisting the external job ID to *task_state_store*
before polling, which is built on AIP-103. On retry,
it reads the ID back and reconnects to the running job instead of
resubmitting. It is capable of handling three cases:

* still active: reconnect and keep polling
* already succeeded: return result immediately
* terminal failure: resubmit fresh

The SparkSubmitOperator has been ported over to be crash-safe across all
its cluster modes: standalone, spark on yarn, spark on k8s
as an initial case study because solving spark means solving most other
data engineering workloads. Related PRs:

* Spark Standalone: https://github.com/apache/airflow/pull/67118
* Spark on yarn: https://github.com/apache/airflow/pull/67473
* Spark on k8s: https://github.com/apache/airflow/pull/68067

*Looking for more patterns*
If you maintain/work with an operator that fits the below criteria, or have
hit the duplicate-job-on-retry problem, do share it here or you can
even ping me on slack and we can collaborate on migrating it to be
crash-safe.

Criteria for a good fit:
* Operator submits to an external system and polls
* The job has a stable tracking ID that survives the worker process
* The external system keeps the job alive independently

Looking forward to hearing from the community and excited to see what gets
built on top of this.

Thanks & Regards,
Amogh Desai

Reply via email to