[ https://issues.apache.org/jira/browse/FLINK-34576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
chenyuzhi updated FLINK-34576: ------------------------------ Description: The HA mode of flink-kubernetes-operator is being used. When one of the pods of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the leader. However, some flinkdeployments have been in the *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time. Through the cmd "kubectl describe flinkdeployment xxx", can see the following error, but there are no exceptions in the flink-kubernetes-operator log. {code:java} Status: Cluster Info: Flink - Revision: b6d20ed @ 2023-12-20T10:01:39+01:00 Flink - Version: 1.14.0-GDC1.6.0 Total - Cpu: 7.0 Total - Memory: 30064771072 Error: {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: Failed to load configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException: Failed to load configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed to load configuration","additionalMetadata":{}}]} Job Manager Deployment Status: READY Job Status: Job Id: cf44b5e73a1f263dd7d9f2c82be5216d Job Name: noah_stream_studio_1754211682_2218100380 Savepoint Info: Last Periodic Savepoint Timestamp: 0 Savepoint History: Start Time: 1705635107137 State: RECONCILING Update Time: 1709272530741 Lifecycle State: STABLE {code} !image-2024-03-05-15-13-11-032.png! version: flink-kubernetes-operator: 1.6.1 flink: 1.14.0/1.15.2 作业规模: flinkdeployment 1200+ [~gyfora] was: The HA mode of flink-kubernetes-operator is being used. When one of the pods of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the leader. However, some flinkdeployments have been in the *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time. Through the cmd "kubectl describe flinkdeployment xxx", can see the following error, but there are no exceptions in the flink-kubernetes-operator log. {code:java} Status: Cluster Info: Flink - Revision: b6d20ed @ 2023-12-20T10:01:39+01:00 Flink - Version: 1.14.0-GDC1.6.0 Total - Cpu: 7.0 Total - Memory: 30064771072 Error: {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: Failed to load configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException: Failed to load configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed to load configuration","additionalMetadata":{}}]} Job Manager Deployment Status: READY Job Status: Job Id: cf44b5e73a1f263dd7d9f2c82be5216d Job Name: noah_stream_studio_1754211682_2218100380 Savepoint Info: Last Periodic Savepoint Timestamp: 0 Savepoint History: Start Time: 1705635107137 State: RECONCILING Update Time: 1709272530741 Lifecycle State: STABLE {code} !image-2024-03-05-15-13-11-032.png! 版本: flink-kubernetes-operator: 1.6.1 flink: 1.14.0/1.15.2 [~gyfora] > Flink deployment keep staying at RECONCILING/STABLE status > ---------------------------------------------------------- > > Key: FLINK-34576 > URL: https://issues.apache.org/jira/browse/FLINK-34576 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.6.1 > Reporter: chenyuzhi > Priority: Major > Attachments: image-2024-03-05-15-13-11-032.png > > > The HA mode of flink-kubernetes-operator is being used. When one of the pods > of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the > leader. However, some flinkdeployments have been in the > *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time. > Through the cmd "kubectl describe flinkdeployment xxx", can see the following > error, but there are no exceptions in the flink-kubernetes-operator log. > > {code:java} > Status: > Cluster Info: > Flink - Revision: b6d20ed @ 2023-12-20T10:01:39+01:00 > Flink - Version: 1.14.0-GDC1.6.0 > Total - Cpu: 7.0 > Total - Memory: 30064771072 > Error: > {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException: > java.lang.RuntimeException: Failed to load > configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException: > Failed to load > configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed > to load configuration","additionalMetadata":{}}]} > Job Manager Deployment Status: READY > Job Status: > Job Id: cf44b5e73a1f263dd7d9f2c82be5216d > Job Name: noah_stream_studio_1754211682_2218100380 > Savepoint Info: > Last Periodic Savepoint Timestamp: 0 > Savepoint History: > Start Time: 1705635107137 > State: RECONCILING > Update Time: 1709272530741 > Lifecycle State: STABLE {code} > > !image-2024-03-05-15-13-11-032.png! > > version: > flink-kubernetes-operator: 1.6.1 > flink: 1.14.0/1.15.2 > > 作业规模: > flinkdeployment 1200+ > [~gyfora] -- This message was sent by Atlassian Jira (v8.20.10#820010)