jiangzho opened a new pull request, #341:
URL: https://github.com/apache/spark-kubernetes-operator/pull/341

   ### What changes were proposed in this pull request?
   
   This PR adds exception handling for status recorder when persisting status. 
If encountered exceptions in updating app status, reconciler would finish 
current reconcile loop and requeue a new one instead of blindly throw the 
exception.
   
   ### Why are the changes needed?
   
   SparkAppReconciler is not handling exceptions when updating app status - 
these failed reconcile loops may end up with endless retry if the status update 
is caused by conflicts.
   
   For example, we observe exceptions like these when app is starting
   
   ```
   Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure 
executing: PUT at: 
https://kind-control-plane.vsl:6443/apis/spark.apache.org/v1/namespaces/default/sparkapplications/spark-example-retain-duration/status.
 Message: Operation cannot be fulfilled on sparkapplications.spark.apache.org 
"spark-example-retain-duration": the object has been modified; please apply 
your changes to the latest version and try again. Received status: 
Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], 
group=spark.apache.org, kind=sparkapplications, 
name=spark-example-retain-duration, retryAfterSeconds=null, uid=null, 
additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on 
sparkapplications.spark.apache.org "spark-example-retain-duration": the object 
has been modified; please apply your changes to the latest version and try 
again, metadata=ListMeta(_continue=null, remainingItemCount=null, 
resourceVersion=null, selfLink=null, additionalProperties={}),
  reason=Conflict, status=Failure, additionalProperties={}).
   at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:642)
 ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
   at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:622)
 ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
   at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:582)
 ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
   at 
io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:549)
 ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
   at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:646)
 ~[?:?]
   at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) 
~[?:?]
   at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) 
~[?:?]
   at 
io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:141)
 ~[spark-kubernetes-operator-0.5.0-SNAPSHOT-all.jar:?]
   ```
   
   Why is this happening ? Because reconcile can be triggered again by driver 
pod status update while another reconcile is in-progress. Without proper 
exception handling, this would keep recurring, taking unnecessary CPU time and 
confusing user in logs.
   
   Another corner case that would be fixed by this patch: current `AppInitStep` 
would mark app as `SchedulingFailure` even if the resources are requested as 
expected and only the status update fails. This patch handles exceptions for 
resource creation and status update separately. 
   
   We'd better digest this better: if an exception is thrown while updating app 
status (which is typically at the last of each reconcile) - operator shall 
properly finish this reconcile loop and start a new one. App status is fetched 
from cache at the beginning of each reconcile - and our reconcile steps are 
ready designed to be idempotent.
   
   ### Does this PR introduce any user-facing change?
   
   No
   
   ### How was this patch tested?
   
   Tested from two perspectives:
   
   1. CIs - also added new unit test to validate update idempotency in init step
   2. Check logs : with rthis patch, no more exceptions like above is thrown 
while running apps
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to