peter-toth commented on code in PR #514:
URL:
https://github.com/apache/spark-kubernetes-operator/pull/514#discussion_r2890712038
##########
docs/spark_custom_resources.md:
##########
@@ -234,14 +234,111 @@ restartConfig:
restartBackoffMillis: 30000
```
+### Granular Restart Control
+
+For more fine-grained control over restart behavior, you can configure
different retry limits
+and backoff times for specific failure types. This allows you to handle
different failure
+scenarios with appropriate strategies.
+
+The operator maintains multiple counters to track different types of restarts:
+- General restart counter: Tracks all restarts
+- Consecutive failure counter: Tracks consecutive failures
+- Consecutive scheduling failure counter: Tracks consecutive scheduling
failures only
+
+#### Restart Behavior Control
+
+- Consecutive failure tracking: The failure-specific counters track
consecutive failures
+ of the app, distinguishing between persistent failures (requiring
intervention) and
+ transient issues (safe for retry).
+ - For Example: With `restartPolicy=Always`, `maxRestartAttempts=5` and
`maxRestartOnFailure=2`:
+ - The app would tolerate at maximum of 3 consecutive failures, with maximal
of 5 restarts
+ - In other words, sequence F -> F -> F would stop.
+ - sequence F -> S -> F -> S -> F would continue with the 5th restart as the
succeeded attempts
+ reset the failure counter
+- Granular control over `SchedulingFailure`: similarly, it's possible to
control the maximal
+ restart and backoff interval for consecutive `SchedulingFailure` attempts,
as it can be highly
+ associated with API server rejections, quota exceeded, resource constraints.
+
+#### Restart Limit Evaluation
+
+When an attempt ends, limits are checked in order:
+ 1. General limit (`maxRestartAttempts`) is checked for every restart
+ 2. For failures, the most specific applicable limit is also checked:
+ - Scheduling failures (SchedulingFailure) →
`maxRestartOnSchedulingFailure` (if set)
+ - Other failures → `maxRestartOnFailure` (if set)
+ 3. The application stops if any applicable limit is exceeded
+
+
+#### Configuration Fields
+
+```yaml
+restartConfig:
+ restartPolicy: Always
+ # Default restart configuration (applies to all restarts)
+ maxRestartAttempts: 5
+ restartBackoffMillis: 30000 # 30 seconds
+
+ # Override for consecutive general failures (application crashes, driver
failures, etc.)
+ # This counter resets to 0 on success
+ maxRestartOnFailure: 3
+ restartBackoffMillisForFailure: 60000 # 1 minute
+
+ # Override for consecutive scheduling failures
+ maxRestartOnSchedulingFailure: 1
+ restartBackoffMillisForSchedulingFailure: 300000 # 5 minutes
+```
+
+#### Example Use Cases
+
+Tolerate transient failures but stop on persistent issues:
+
+```yaml
+restartConfig:
+ restartPolicy: Always
+ maxRestartAttempts: 100 # Allow many total attempts
+ restartBackoffMillis: 30000
+ # But stop after 3 consecutive failures (indicates persistent problem)
+ maxRestartOnFailure: 3
+ restartBackoffMillisForFailure: 60000
+```
+
+Mitigate API server stress during scheduling failures:
+
+```yaml
+restartConfig:
+ restartPolicy: Always
+ maxRestartAttempts: 50
+ restartBackoffMillis: 30000
+ # Stop quickly on scheduling failures to avoid overwhelming API server
+ maxRestartOnSchedulingFailure: 2
+ restartBackoffMillisForSchedulingFailure: 600000 # 10 minutes
+```
+
+
+| Field
| Type | Default Value | Description
|
+|-----------------------------------------------------------------------------------------|---------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| .spec.applicationTolerations.restartConfig.restartPolicy
| string | Never | Restart policy: `Never`, `Always`,
`OnFailure`, or `OnInfrastructureFailure`
|
+| .spec.applicationTolerations.restartConfig.maxRestartAttempts
| integer | 3 | Maximum number of restart attempts for
all scenarios (always checked)
|
+| .spec.applicationTolerations.restartConfig.restartBackoffMillis
| integer | 30000 | Default backoff time in milliseconds
between restart attempts
|
+| .spec.applicationTolerations.restartConfig.maxRestartOnFailure
| integer | null | Maximum consecutive failures before
stopping. Resets to 0 on success. If null, uses maxRestartAttempts
|
Review Comment:
Yeah, I see your point. These configs are kind of overrides for the default
`maxRestartAttempts` / `restartBackoffMillis` configs in a special case, and as
such, not overriding or setting a special override value (e.g. -1 for
unlimited) can be different.
Yes, maybe 1 makes sense as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]