[PR] [SPARK-55623] Add granular restart control with consecutive failure tracking [spark-kubernetes-operator]

via GitHub Fri, 20 Feb 2026 17:25:13 -0800


jiangzho opened a new pull request, #514:
URL: https://github.com/apache/spark-kubernetes-operator/pull/514


   ### What changes were proposed in this pull request?
   
   This PR implements granular restart control for Spark applications with 
support for consecutive failure / scheduling failures tracking.
   
   ### Why are the changes needed?
   
   Consecutive failure tracking enables operators to allow more total restarts 
for apps with occasional transient failuresm, and to stop quickly on persistent 
failures. It also apply special handling for scheduling failures to mitigate 
API server stress.
   
   This is an enhancement based on our current configuration for maximal 
restarts and backoff interval.
   
   ### Does this PR introduce any user-facing change?
   
   Yes. New optional configuration fields availale in RestartConfig:
   
   ```yaml
     restartConfig:
       restartPolicy: Always
       maxRestartAttempts: 5  # existing field
       restartBackoffMillis: 30000  # existing field
       restartCounterResetMillis: 3600000  # existing field
   
       # New: consecutive failure limits
       maxRestartOnFailure: 3
       restartBackoffMillisForFailure: 60000
       maxRestartOnSchedulingFailure: 2
       restartBackoffMillisForSchedulingFailure: 300000
   ```
   
   This should be backwards compartible. All new fields are optional. When 
failure-specific limits are not set, the operator uses maxRestartAttempts as 
before.
   
   ### How was this patch tested?
   
   Unit tests added to validate the limit evaluation flow, with & without the 
new fields.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-55623] Add granular restart control with consecutive failure tracking [spark-kubernetes-operator]

Reply via email to