[ https://issues.apache.org/jira/browse/FLINK-34576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824324#comment-17824324 ]
chenyuzhi commented on FLINK-34576: ----------------------------------- Analyzing from the log of the operator above, when the operator switches leader, although oldLeader finally exits, but oldLeader does not know that it is no longer the leader until the final process exits, so it updates flinkdeployment at the same time with newLeader, resulting in statusConflict. Refer to [JOSDK 4.4.4 源码|https://github.com/operator-framework/java-operator-sdk/blob/6238ef21d6761fb99731fd4903c077ad10258b64/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/LeaderElectionManager.java#L79] , before version 4.5, JOSDK did not provide this notification mechanism to tell the upper-layer application instance that it is no longer the leader. Instead, it maintained the logic of startLeader/stopLeader internally. After version 4.5, JOSDK provides a callback mechanism to notify the upper-layer application whether it is the leader, which is the scenario mentioned in the above [issue|https://github.com/operator-framework/java-operator-sdk/issues/2009] . However, this requires certain adaptations on the application side, such as setting a leader flag variable. We can refer to flink using fabric8 Leader callback mechanism which using configMap for HA, refer to [flink k8s HA 源码|https://github.com/apache/flink/blob/9b1375520b6b351df7551d85fcecd920e553cc3a/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L116] Of course, we currently plan to adjust the two parameters leaseDuration/renewDuration, such as making the difference between (leaseDuration - renewDuration) (the default value is 15s - 10 s = 5 s) larger to ensure that oldeLeader has more time to exit before newLeader is generated. , which can alleviate the current update conflict problem, but it cannot completely solve it. h1. > Flink deployment keep staying at RECONCILING/STABLE status > ---------------------------------------------------------- > > Key: FLINK-34576 > URL: https://issues.apache.org/jira/browse/FLINK-34576 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.6.1 > Reporter: chenyuzhi > Priority: Major > Attachments: image-2024-03-05-15-13-11-032.png > > > The HA mode of flink-kubernetes-operator is being used. When one of the pods > of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the > leader. However, some flinkdeployments have been in the > *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time. > Through the cmd "kubectl describe flinkdeployment xxx", can see the following > error, but there are no exceptions in the flink-kubernetes-operator log. > > {code:java} > Status: > Cluster Info: > Flink - Revision: b6d20ed @ 2023-12-20T10:01:39+01:00 > Flink - Version: 1.14.0-GDC1.6.0 > Total - Cpu: 7.0 > Total - Memory: 30064771072 > Error: > {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException: > java.lang.RuntimeException: Failed to load > configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException: > Failed to load > configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed > to load configuration","additionalMetadata":{}}]} > Job Manager Deployment Status: READY > Job Status: > Job Id: cf44b5e73a1f263dd7d9f2c82be5216d > Job Name: noah_stream_studio_1754211682_2218100380 > Savepoint Info: > Last Periodic Savepoint Timestamp: 0 > Savepoint History: > Start Time: 1705635107137 > State: RECONCILING > Update Time: 1709272530741 > Lifecycle State: STABLE {code} > > !image-2024-03-05-15-13-11-032.png! > > version: > flink-kubernetes-operator: 1.6.1 > flink: 1.14.0/1.15.2 (flinkdeployment 1200+) > -- This message was sent by Atlassian Jira (v8.20.10#820010)