aalopatin opened a new pull request, #63922:
URL: https://github.com/apache/airflow/pull/63922

   Previously, when a SparkKubernetesOperator task failed because the 
SparkApplication CRD stayed in the SUBMITTED state longer than 
`startup_timeout_seconds`, the SparkApplication would remain in Kubernetes. 
This could lead to orphaned applications consuming resources, especially in 
clusters managed by YuniKorn.
   
   This PR ensures that the SparkApplication is deleted when a task fails due 
to exceeding `startup_timeout_seconds`. The fix adds a deletion step in the 
failure path of `start_spark_job`, logging a warning when the job is deleted 
due to task failure.
   
   **Behavior before:**
   
   - Task fails after timeout
   - SparkApplication remains in Kubernetes
   
   **Behavior after:**
   
   - Task fails after timeout
   - SparkApplication is deleted from Kubernetes
   - Warning logged for failed cleanup
   
   This change addresses the scenario described in issue #63824 and improves 
reliability and resource cleanup for SparkKubernetesOperator tasks.
   
   **Notes:**
   
   - No behavioral changes occur when tasks succeed — normal cleanup remains 
unchanged.
   - Only applies to cases where the task fails due to startup_timeout_seconds.
   
   * closes: #63824 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to