Hi. I already asked before but never got an answer. My observation is that the operator, after collecting some stats, is trying to restart one of the deployments. This includes taking a savepoint (`takeSavepointOnUpgrade: true`, `upgradeMode: savepoint`) and “gracefully” shutting down the JobManager by “scaling it to zero” (by setting replicas = 0 in the new generated config).
However, the deployment never comes back up, apparently, due to exception: 2024-04-25 17:20:52,920 mi.j.o.p.e.ReconciliationDispatcher [ERROR][flink/f-d7681d0f-c093-5d8a-b5f5-2b66b4547bf6] Error during error status handling. org.apache.flink.kubernetes.operator.exception.StatusConflictException: Status have been modified externally in version 50607043 Previous: {"jobStatus":{"jobName":"autoscaling test:attack-surface","jobId":"be93ad9b152c1f11696e971e6a638b63","state":"FINISHEDINFO\\n\"},\"mode\":\"native\"},\"resource_metadata\":… at org.apache.flink.kubernetes.operator.utils.StatusRecorder.replaceStatus(StatusRecorder.java:161) at org.apache.flink.kubernetes.operator.utils.StatusRecorder.patchAndCacheStatus(StatusRecorder.java:97) at org.apache.flink.kubernetes.operator.reconciler.ReconciliationUtils.toErrorStatusUpdateControl(ReconciliationUtils.java:438) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.updateErrorStatus(FlinkDeploymentController.java:209) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.updateErrorStatus(FlinkDeploymentController.java:57) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleErrorStatusHandler(ReconciliationDispatcher.java:194) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:123) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:91) at io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:64) at io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor.run(EventProcessor.java:417) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) 2024-04-25 17:20:52,925 mi.j.o.p.e.ReconciliationDispatcher [ERROR][flink/f-d7681d0f-c093-5d8a-b5f5-2b66b4547bf6] Error during event processing ExecutionScope{ resource id: ResourceID{name='f-d7681d0f-c093-5d8a-b5f5-2b66b4547bf6', namespace='flink'}, version: 50606957} failed. org.apache.flink.kubernetes.operator.exception.ReconciliationException: org.apache.flink.kubernetes.operator.exception.StatusConflictException: Status have been modified externally in version 50607043 Previous: {"jobStatus":{"jobName":"autoscaling test:attack-surface","jobId":"be93ad9b152c1f11696e971e6a638b63","state":"FINISHED", Caused by: org.apache.flink.kubernetes.operator.exception.StatusConflictException: Status have been modified externally in version 50607043 Previous: {"jobStatus":{"jobName":"autoscaling test:attack-surface","jobId":"be93ad9b152c1f11696e971e6a638b63","state":"FINISHED at org.apache.flink.kubernetes.operator.utils.StatusRecorder.replaceStatus(StatusRecorder.java:161) at org.apache.flink.kubernetes.operator.utils.StatusRecorder.patchAndCacheStatus(StatusRecorder.java:97) at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:175) at org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:63) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:279) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:156) at org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:171) at org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:145) ... 13 more How to fix this? Why is the deployment not coming back up after this exception? Is there an configuration property to set a number of retires? Thanks, Maxim ________________________________ COGILITY SOFTWARE CORPORATION LEGAL DISCLAIMER: The information in this email is confidential and is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful.