jiangzho opened a new pull request, #514:
URL: https://github.com/apache/spark-kubernetes-operator/pull/514
### What changes were proposed in this pull request?
This PR implements granular restart control for Spark applications with
support for consecutive failure / scheduling failures tracking.
### Why are the changes needed?
Consecutive failure tracking enables operators to allow more total restarts
for apps with occasional transient failuresm, and to stop quickly on persistent
failures. It also apply special handling for scheduling failures to mitigate
API server stress.
This is an enhancement based on our current configuration for maximal
restarts and backoff interval.
### Does this PR introduce any user-facing change?
Yes. New optional configuration fields availale in RestartConfig:
```yaml
restartConfig:
restartPolicy: Always
maxRestartAttempts: 5 # existing field
restartBackoffMillis: 30000 # existing field
restartCounterResetMillis: 3600000 # existing field
# New: consecutive failure limits
maxRestartOnFailure: 3
restartBackoffMillisForFailure: 60000
maxRestartOnSchedulingFailure: 2
restartBackoffMillisForSchedulingFailure: 300000
```
This should be backwards compartible. All new fields are optional. When
failure-specific limits are not set, the operator uses maxRestartAttempts as
before.
### How was this patch tested?
Unit tests added to validate the limit evaluation flow, with & without the
new fields.
### Was this patch authored or co-authored using generative AI tooling?
No
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]