[jira] [Commented] (FLINK-34576) Flink deployment keep staying at RECONCILING/STABLE status

chenyuzhi (Jira) Thu, 07 Mar 2024 01:06:36 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824324#comment-17824324
 ]


chenyuzhi commented on FLINK-34576:
-----------------------------------

Analyzing from the log of the operator above, when the operator switches 
leader, although oldLeader finally exits, but oldLeader does not know that it 
is no longer the leader until the final process exits, so it updates 
flinkdeployment at the same time with newLeader, resulting in statusConflict.

Refer to [JOSDK 4.4.4 
源码|https://github.com/operator-framework/java-operator-sdk/blob/6238ef21d6761fb99731fd4903c077ad10258b64/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/LeaderElectionManager.java#L79]
 ，

 before version 4.5, JOSDK did not provide this notification mechanism to tell 
the upper-layer application instance that it is no longer the leader. Instead, 
it maintained the logic of startLeader/stopLeader internally.

 

After version 4.5, JOSDK provides a callback mechanism to notify the 
upper-layer application whether it is the leader, which is the scenario 
mentioned in the above  
[issue|https://github.com/operator-framework/java-operator-sdk/issues/2009] . 
However, this requires certain adaptations on the application side, such as 
setting a leader flag variable. We can refer to flink using fabric8 Leader 
callback mechanism which  using configMap for HA, refer to [flink k8s HA 
源码|https://github.com/apache/flink/blob/9b1375520b6b351df7551d85fcecd920e553cc3a/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/resources/KubernetesLeaderElector.java#L116]

 

Of course, we currently plan to adjust the two parameters 
leaseDuration/renewDuration, such as making the difference between 
(leaseDuration - renewDuration) (the default value is 15s - 10 s = 5 s) larger 
to ensure that oldeLeader has more time to exit before newLeader is generated. 
, which can alleviate the current update conflict problem, but it cannot 
completely solve it.

 

 
h1.

> Flink deployment keep staying at RECONCILING/STABLE status
> ----------------------------------------------------------
>
>                 Key: FLINK-34576
>                 URL: https://issues.apache.org/jira/browse/FLINK-34576
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.6.1
>            Reporter: chenyuzhi
>            Priority: Major
>         Attachments: image-2024-03-05-15-13-11-032.png
>
>
> The HA mode of flink-kubernetes-operator is being used. When one of the pods 
> of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the 
> leader. However, some flinkdeployments have been in the 
> *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.
> Through the cmd "kubectl describe flinkdeployment xxx", can see the following 
> error, but there are no exceptions in the flink-kubernetes-operator log.
>  
> {code:java}
> Status:
>   Cluster Info:
>     Flink - Revision:             b6d20ed @ 2023-12-20T10:01:39+01:00
>     Flink - Version:              1.14.0-GDC1.6.0
>     Total - Cpu:                  7.0
>     Total - Memory:               30064771072
>   Error:                          
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
>  java.lang.RuntimeException: Failed to load 
> configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
>  Failed to load 
> configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
>  to load configuration","additionalMetadata":{}}]}
>   Job Manager Deployment Status:  READY
>   Job Status:
>     Job Id:    cf44b5e73a1f263dd7d9f2c82be5216d
>     Job Name:  noah_stream_studio_1754211682_2218100380
>     Savepoint Info:
>       Last Periodic Savepoint Timestamp:  0
>       Savepoint History:
>     Start Time:     1705635107137
>     State:          RECONCILING
>     Update Time:    1709272530741
>   Lifecycle State:  STABLE {code}
>  
> !image-2024-03-05-15-13-11-032.png!
>  
> version：
> flink-kubernetes-operator: 1.6.1
> flink: 1.14.0/1.15.2 (flinkdeployment 1200+)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34576) Flink deployment keep staying at RECONCILING/STABLE status

Reply via email to