[ https://issues.apache.org/jira/browse/FLINK-34576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823663#comment-17823663 ]
chenyuzhi commented on FLINK-34576: ----------------------------------- Thanks for the reply. 1. Is there a way to somehow repro this on a smaller case? I have tried to simulate leader switching by deleting pod in the test environment, but without repro. In the production environment, it is very likely to occur (maybe it is related to the load?). Maybe there is some way to make the operator pod lost the leader to repro(not delete pod, but I haven't found any other way to make the pod lost the leader) 2. Have you tried operator version 1.7.0? We may have fixed the issue there already It has not been upgraded to use 1.7.0 because this version no longer supports Flink1.14.0, but our production environment is still using it. Are you pointing about this [JOSDK issue|https://github.com/operator-framework/java-operator-sdk/issues/2056]? We did encounter a split-brain problem similar to multiple leaders earlier, but As mentioned in the first question, this status exception will still occur after the master is successfully switched (by checking the log oldLeader exit, newLeader takeover). 3. Does it also affect newer Flink versions as well? Our highest Flink version is 1.15.2, and the impact of higher versions is uncertain. 4. Can you share some relevant operator logs? Sure. operatorA log when leader switches (stopping leader appears), take it from log-file {code:java} 2024-03-05 04:35:46,565 o.a.f.c.Configuration [WARN ][gdc-qdata-bu/prod-s1-monitor-reward-sjuneizhandoushemenrel] Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' 2024-03-05 04:35:46,567 o.a.f.c.Configuration [WARN ][gdc-gdc-sa/logstream-panama-panama-h73na-serverlog-produ] Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' 2024-03-05 04:35:46,569 o.a.f.c.Configuration [WARN ][gdc-gdc-sa/logstream-erie-erie-gzailab-sym2-ns-imageveri] Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' 2024-03-05 04:35:46,569 o.a.f.c.Configuration [WARN ][gdc-gdc-sa/test-vk-log3] Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' 2024-03-05 04:35:46,574 i.j.o.LeaderElectionManager [INFO ] New leader with identity: 2024-03-05 04:35:46,584 o.a.f.c.Configuration [WARN ][gdc-cld-bu/logdistribution-grand-cld-dnode-contianer-ba3] Config uses deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 'kubernetes.jobmanager.cpu.amount' 2024-03-05 04:35:46,584 o.a.f.c.Configuration [WARN ][gdc-cld-bu/logdistribution-grand-cld-dnode-contianer-ba3] Config uses deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 'kubernetes.taskmanager.cpu.amount' 2024-03-05 04:35:46,586 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO ][gdc-cld-bu/logdistribution-grand-cld-dnode-contianer-ba3] Resource fully reconciled, nothing to do... 2024-03-05 04:35:46,586 i.j.o.LeaderElectionManager [INFO ] Stopped leading for identity: flink-kubernetes-operator-85f6994468-cpsx9. Exiting. 2024-03-05 04:35:46,589 o.a.f.k.o.l.AuditUtils [INFO ][gdc-gdc-bu/test-lag-202306-v2-copy-cpu] >>> Status | Error | STABLE | {"type":"org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA metadata not available to restore from last state. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. Manual restore required.","additionalMetadata":{},"throwableList":[]} 2024-03-05 04:35:46,591 o.a.f.c.Configuration [WARN ][gdc-a29-bu/logdistribution-xia-xia-a29-pc-vm-log-product] Config uses deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 'kubernetes.jobmanager.cpu.amount' 2024-03-05 04:35:46,591 o.a.f.c.Configuration [WARN ][gdc-a29-bu/logdistribution-xia-xia-a29-pc-vm-log-product] Config uses deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 'kubernetes.taskmanager.cpu.amount' 2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN ][gdc-gdc-sa/logstream-grand-grand-s8-serverlog-production] Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' 2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN ][gdc-gdc-sa/logstream-jinghang-jinghang-g106-seazyi-nginx] Config uses deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 'kubernetes.jobmanager.cpu.amount' 2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN ][gdc-gdc-sa/logstream-jinghang-jinghang-g106-seazyi-nginx] Config uses deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 'kubernetes.taskmanager.cpu.amount' 2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN ][gdc-gdc-sa/logstream-jinghang-jinghang-artct-outer-p4-se] Config uses deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 'kubernetes.jobmanager.cpu.amount' 2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN ][gdc-gdc-sa/logstream-jinghang-jinghang-artct-outer-p4-se] Config uses deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 'kubernetes.taskmanager.cpu.amount' 2024-03-05 04:35:46,592 o.a.f.c.Configuration [WARN ][gdc-qdata-bu/prod-s1-monitor-reward-sjuneizhandoushemenrel] Config uses deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 'kubernetes.jobmanager.cpu.amount' 2024-03-05 04:35:46,593 o.a.f.c.Configuration [WARN ][gdc-qdata-bu/prod-s1-monitor-reward-sjuneizhandoushemenrel] Config uses deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 'kubernetes.taskmanager.cpu.amount' 2024-03-05 04:35:46,593 o.a.f.c.Configuration [WARN ][gdc-gdc-sa/logstream-panama-panama-h73na-serverlog-produ] Config uses deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 'kubernetes.jobmanager.cpu.amount' 2024-03-05 04:35:46,593 o.a.f.c.Configuration [WARN ][gdc-gdc-sa/logstream-panama-panama-h73na-serverlog-produ] Config uses deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 'kubernetes.taskmanager.cpu.amount' {code} OperatorB log when switching, take it from es (the format is a little different from the above log file) {code:java} -- Meters --------------------------------------------------------------------- flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.NumPerSecond: 0.35 flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpRequest.NumPerSecond: 0.35 flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.201.NumPerSecond: 0.0 flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.200.NumPerSecond: 0.3333333333333333 flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.101.NumPerSecond: 0.016666666666666666 flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpRequest.Failed.NumPerSecond: 0.0-- Histograms --------------------------------------------------------------------- flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.TimeNanos: count=1000, min=939944, max=49957558, mean=1717875.3819999998, stddev=2272368.0974273267, p50=1293964.5, p75=1475059.25, p95=3561530.249999989, p98=6726813.320000002, p99=8400899.390000004, p999=4.9932472127003446E7=========================== Finished metrics report ===========================" 2024-03-04T20:35:48.416Z,"2024-03-05 04:35:48,027 INFO io.javaoperatorsdk.operator.LeaderElectionManager - New leader with identity: " 2024-03-04T20:35:48.416Z,"2024-03-05 04:35:48,121 INFO io.javaoperatorsdk.operator.LeaderElectionManager - New leader with identity: flink-kubernetes-operator-85f6994468-92xsz " 2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,126 INFO io.javaoperatorsdk.operator.processing.Controller - Started event processing for controller: flinksessionjobcontroller " 2024-03-04T20:35:49.420Z,"2024-03-05 04:35:49,150 WARN org.apache.flink.configuration.Configuration [gdc-gdc-sa/logstream-wei-ma65-production] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.420Z,"2024-03-05 04:35:49,150 WARN org.apache.flink.configuration.Configuration [gdc-nsh-bu/logdistribution-kiel-kiel-nsh-lhall-eos-produ] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,905 WARN org.apache.flink.configuration.Configuration [gdc-nsh-bu/logdistribution-kiel-kiel-nsh-lhall-eos-produ] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,905 WARN org.apache.flink.configuration.Configuration [gdc-a29-bu/logdistribution-tang-tang-a29-zycenter-hub-pr] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.420Z,"2024-03-05 04:35:49,150 WARN org.apache.flink.configuration.Configuration [gdc-g117-bu/logdistribution-welland-welland-g117-serverlo] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,901 WARN org.apache.flink.configuration.Configuration [gdc-gdc-sa/logstream-jinghang-jinghang-opd-java-fs-log-p] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,913 WARN org.apache.flink.configuration.Configuration [gdc-g117-bu/logdistribution-welland-welland-g117-serverlo] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,921 WARN org.apache.flink.configuration.Configuration [gdc-qdata-bu/prod-g17-reward-dynamic-huodongchangzhuhuodon] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.421Z,"2024-03-05 04:35:49,150 WARN org.apache.flink.configuration.Configuration [gdc-qdata-bu/prod-g48-monitor-reward-xinzengdaojujiankong] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.421Z,"2024-03-05 04:35:48,920 WARN org.apache.flink.configuration.Configuration [gdc-qdata-bu/prod-g17-reward-dynamic-reward-huodongchangzh] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.421Z,"2024-03-05 04:35:49,150 WARN org.apache.flink.configuration.Configuration [gdc-gdc-sa/logstream-jinghang-jinghang-opd-java-fs-log-p] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.421Z,"2024-03-05 04:35:48,919 WARN org.apache.flink.configuration.Configuration [gdc-gdc-sa/logstream-panama-panama-h72-hexfps-proxima-pr] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' " 2024-03-04T20:35:49.421Z,"2024-03-05 04:35:49,150 WARN org.apache.flink.configuration.Configuration [gdc-qdata-bu/prod-g17-reward-dynamic-reward-huodongchangzh] - Config uses deprecated configuration key 'high-availability' instead of proper key 'high-availability.type' {code} > Flink deployment keep staying at RECONCILING/STABLE status > ---------------------------------------------------------- > > Key: FLINK-34576 > URL: https://issues.apache.org/jira/browse/FLINK-34576 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.6.1 > Reporter: chenyuzhi > Priority: Major > Attachments: image-2024-03-05-15-13-11-032.png > > > The HA mode of flink-kubernetes-operator is being used. When one of the pods > of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the > leader. However, some flinkdeployments have been in the > *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time. > Through the cmd "kubectl describe flinkdeployment xxx", can see the following > error, but there are no exceptions in the flink-kubernetes-operator log. > > {code:java} > Status: > Cluster Info: > Flink - Revision: b6d20ed @ 2023-12-20T10:01:39+01:00 > Flink - Version: 1.14.0-GDC1.6.0 > Total - Cpu: 7.0 > Total - Memory: 30064771072 > Error: > {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException: > java.lang.RuntimeException: Failed to load > configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException: > Failed to load > configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed > to load configuration","additionalMetadata":{}}]} > Job Manager Deployment Status: READY > Job Status: > Job Id: cf44b5e73a1f263dd7d9f2c82be5216d > Job Name: noah_stream_studio_1754211682_2218100380 > Savepoint Info: > Last Periodic Savepoint Timestamp: 0 > Savepoint History: > Start Time: 1705635107137 > State: RECONCILING > Update Time: 1709272530741 > Lifecycle State: STABLE {code} > > !image-2024-03-05-15-13-11-032.png! > > version: > flink-kubernetes-operator: 1.6.1 > flink: 1.14.0/1.15.2 (flinkdeployment 1200+) > -- This message was sent by Atlassian Jira (v8.20.10#820010)