Congrats, Amogh! This is great to see.

I'm happy that I could help test the Kerberos auth part for Spark on YARN.

Best,
Aaron

On Wed, Jun 24, 2026 at 1:16 AM Amogh Desai <[email protected]> wrote:

> Hi All,
>
> As we gear up towards Airflow 3.3, I am excited to announce that
> durable/crash-safe operator execution is
> coming in with Airflow 3.3.
>
> I just merged https://github.com/apache/airflow/pull/68623 which adds a
> *durable* flag to *ResumableJobMixin* in
> the task SDK version 1.3.0 which will be launched with 3.3 airflow. Wanted
> to share what it does and hear from the
> community.
>
> *What the mixin does*
> I have showcased this a couple times in dev calls / airflow town halls but
> here's a short recap for broader audience:
>
> Operators that submit a long running job to an external system (Spark,
> YARN, Databricks, etc.) and poll for completion
> have a classic retry issue: worker crashes during polling, retry submits a
> duplicate job because airflow is not *external*
>
> *system aware.*
> The mixin fixes this by persisting the external job ID to
> *task_state_store*
> before polling, which is built on AIP-103. On retry,
> it reads the ID back and reconnects to the running job instead of
> resubmitting. It is capable of handling three cases:
>
> * still active: reconnect and keep polling
> * already succeeded: return result immediately
> * terminal failure: resubmit fresh
>
> The SparkSubmitOperator has been ported over to be crash-safe across all
> its cluster modes: standalone, spark on yarn, spark on k8s
> as an initial case study because solving spark means solving most other
> data engineering workloads. Related PRs:
>
> * Spark Standalone: https://github.com/apache/airflow/pull/67118
> * Spark on yarn: https://github.com/apache/airflow/pull/67473
> * Spark on k8s: https://github.com/apache/airflow/pull/68067
>
> *Looking for more patterns*
> If you maintain/work with an operator that fits the below criteria, or have
> hit the duplicate-job-on-retry problem, do share it here or you can
> even ping me on slack and we can collaborate on migrating it to be
> crash-safe.
>
> Criteria for a good fit:
> * Operator submits to an external system and polls
> * The job has a stable tracking ID that survives the worker process
> * The external system keeps the job alive independently
>
> Looking forward to hearing from the community and excited to see what gets
> built on top of this.
>
> Thanks & Regards,
> Amogh Desai
>

Reply via email to