Re: [PR] [SPARK-55623] Add granular restart control with consecutive failure tracking [spark-kubernetes-operator]

via GitHub Thu, 05 Mar 2026 07:29:27 -0800


peter-toth commented on code in PR #514:
URL: 
https://github.com/apache/spark-kubernetes-operator/pull/514#discussion_r2890712038



##########
docs/spark_custom_resources.md:
##########
@@ -234,14 +234,111 @@ restartConfig:
   restartBackoffMillis: 30000
 ```
 
+### Granular Restart Control
+
+For more fine-grained control over restart behavior, you can configure 
different retry limits
+and backoff times for specific failure types. This allows you to handle 
different failure
+scenarios with appropriate strategies.
+
+The operator maintains multiple counters to track different types of restarts:
+- General restart counter: Tracks all restarts
+- Consecutive failure counter: Tracks consecutive failures
+- Consecutive scheduling failure counter: Tracks consecutive scheduling 
failures only
+
+#### Restart Behavior Control
+
+- Consecutive failure tracking: The failure-specific counters track 
consecutive failures
+  of the app, distinguishing between persistent failures (requiring 
intervention) and
+  transient issues (safe for retry).
+  - For Example: With `restartPolicy=Always`, `maxRestartAttempts=5` and 
`maxRestartOnFailure=2`:
+  - The app would tolerate at maximum of 3 consecutive failures, with maximal 
of 5 restarts
+  - In other words, sequence F -> F -> F would stop.
+  - sequence F -> S -> F -> S -> F would continue with the 5th restart as the 
succeeded attempts
+    reset the failure counter
+- Granular control over `SchedulingFailure`: similarly, it's possible to 
control the maximal
+  restart and backoff interval for consecutive `SchedulingFailure` attempts, 
as it can be highly
+  associated with API server rejections, quota exceeded, resource constraints.
+
+#### Restart Limit Evaluation
+
+When an attempt ends, limits are checked in order:
+  1. General limit (`maxRestartAttempts`) is checked for every restart
+  2. For failures, the most specific applicable limit is also checked:
+     - Scheduling failures (SchedulingFailure) → 
`maxRestartOnSchedulingFailure` (if set)
+     - Other failures → `maxRestartOnFailure` (if set)
+  3. The application stops if any applicable limit is exceeded
+
+
+#### Configuration Fields
+
+```yaml
+restartConfig:
+  restartPolicy: Always
+  # Default restart configuration (applies to all restarts)
+  maxRestartAttempts: 5
+  restartBackoffMillis: 30000  # 30 seconds
+
+  # Override for consecutive general failures (application crashes, driver 
failures, etc.)
+  # This counter resets to 0 on success
+  maxRestartOnFailure: 3
+  restartBackoffMillisForFailure: 60000  # 1 minute
+
+  # Override for consecutive scheduling failures
+  maxRestartOnSchedulingFailure: 1
+  restartBackoffMillisForSchedulingFailure: 300000  # 5 minutes
+```
+
+#### Example Use Cases
+
+Tolerate transient failures but stop on persistent issues:
+
+```yaml
+restartConfig:
+  restartPolicy: Always
+  maxRestartAttempts: 100  # Allow many total attempts
+  restartBackoffMillis: 30000
+  # But stop after 3 consecutive failures (indicates persistent problem)
+  maxRestartOnFailure: 3
+  restartBackoffMillisForFailure: 60000
+```
+
+Mitigate API server stress during scheduling failures:
+
+```yaml
+restartConfig:
+  restartPolicy: Always
+  maxRestartAttempts: 50
+  restartBackoffMillis: 30000
+  # Stop quickly on scheduling failures to avoid overwhelming API server
+  maxRestartOnSchedulingFailure: 2
+  restartBackoffMillisForSchedulingFailure: 600000  # 10 minutes
+```
+
+
+| Field                                                                        
           | Type    | Default Value | Description                              
                                                                                
                                                                                
                                |
+|-----------------------------------------------------------------------------------------|---------|---------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| .spec.applicationTolerations.restartConfig.restartPolicy                     
           | string  | Never         | Restart policy: `Never`, `Always`, 
`OnFailure`, or `OnInfrastructureFailure`                                       
                                                                                
                                      |
+| .spec.applicationTolerations.restartConfig.maxRestartAttempts                
           | integer | 3             | Maximum number of restart attempts for 
all scenarios (always checked)                                                  
                                                                                
                                  |
+| .spec.applicationTolerations.restartConfig.restartBackoffMillis              
           | integer | 30000         | Default backoff time in milliseconds 
between restart attempts                                                        
                                                                                
                                    |
+| .spec.applicationTolerations.restartConfig.maxRestartOnFailure               
           | integer | null          | Maximum consecutive failures before 
stopping. Resets to 0 on success. If null, uses maxRestartAttempts              
                                                                                
                                     |

Review Comment:
   Yeah, I see your point. These configs are kind of overrides for the default 
`maxRestartAttempts` / `restartBackoffMillis` configs in a special case, and as 
such, not overriding or setting a special override value (e.g. -1 for 
unlimited) can be different.
   
   Yes, maybe 1 makes sense as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55623] Add granular restart control with consecutive failure tracking [spark-kubernetes-operator]

Reply via email to