[jira] [Commented] (FLINK-34576) Flink deployment keep staying at RECONCILING/STABLE status

chenyuzhi (Jira) Tue, 05 Mar 2024 07:08:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-34576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823663#comment-17823663
 ]


chenyuzhi commented on FLINK-34576:
-----------------------------------

Thanks for the reply.

1.  Is there a way to somehow repro this on a smaller case?

I have tried to simulate leader switching by deleting pod in the test 
environment, but without repro. In the production environment, it is very 
likely to occur (maybe it is related to the load?).

 

Maybe there is some way to make the operator pod lost the leader to repro(not 
delete pod, but I haven't found any other way to make the pod lost the leader)


2. Have you tried operator version 1.7.0? We may have fixed the issue there 
already

It has not been upgraded to use 1.7.0 because this version no longer supports 
Flink1.14.0, but our production environment is still using it.

 
Are you pointing about this [JOSDK 
issue|https://github.com/operator-framework/java-operator-sdk/issues/2056]? We 
did encounter a split-brain problem similar to multiple leaders earlier, but As 
mentioned in the first question, this status exception will still occur after 
the master is successfully switched (by checking the log oldLeader exit, 
newLeader takeover).
 
3. Does it also affect newer Flink versions as well?
 

Our highest Flink version is 1.15.2, and the impact of higher versions is 
uncertain.
 

4. Can you share some relevant operator logs?

Sure.

 
operatorA log when leader switches (stopping leader appears), take it from 
log-file

 
{code:java}
2024-03-05 04:35:46,565 o.a.f.c.Configuration          [WARN 
][gdc-qdata-bu/prod-s1-monitor-reward-sjuneizhandoushemenrel] Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
2024-03-05 04:35:46,567 o.a.f.c.Configuration          [WARN 
][gdc-gdc-sa/logstream-panama-panama-h73na-serverlog-produ] Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
2024-03-05 04:35:46,569 o.a.f.c.Configuration          [WARN 
][gdc-gdc-sa/logstream-erie-erie-gzailab-sym2-ns-imageveri] Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
2024-03-05 04:35:46,569 o.a.f.c.Configuration          [WARN 
][gdc-gdc-sa/test-vk-log3] Config uses deprecated configuration key 
'high-availability' instead of proper key 'high-availability.type'
2024-03-05 04:35:46,574 i.j.o.LeaderElectionManager    [INFO ] New leader with 
identity: 
2024-03-05 04:35:46,584 o.a.f.c.Configuration          [WARN 
][gdc-cld-bu/logdistribution-grand-cld-dnode-contianer-ba3] Config uses 
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,584 o.a.f.c.Configuration          [WARN 
][gdc-cld-bu/logdistribution-grand-cld-dnode-contianer-ba3] Config uses 
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 
'kubernetes.taskmanager.cpu.amount'
2024-03-05 04:35:46,586 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO 
][gdc-cld-bu/logdistribution-grand-cld-dnode-contianer-ba3] Resource fully 
reconciled, nothing to do...
2024-03-05 04:35:46,586 i.j.o.LeaderElectionManager    [INFO ] Stopped leading 
for identity: flink-kubernetes-operator-85f6994468-cpsx9. Exiting.
2024-03-05 04:35:46,589 o.a.f.k.o.l.AuditUtils         [INFO 
][gdc-gdc-bu/test-lag-202306-v2-copy-cpu] >>> Status | Error   | STABLE         
 | 
{"type":"org.apache.flink.kubernetes.operator.exception.RecoveryFailureException","message":"HA
 metadata not available to restore from last state. It is possible that the job 
has finished or terminally failed, or the configmaps have been deleted. Manual 
restore required.","additionalMetadata":{},"throwableList":[]} 
2024-03-05 04:35:46,591 o.a.f.c.Configuration          [WARN 
][gdc-a29-bu/logdistribution-xia-xia-a29-pc-vm-log-product] Config uses 
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,591 o.a.f.c.Configuration          [WARN 
][gdc-a29-bu/logdistribution-xia-xia-a29-pc-vm-log-product] Config uses 
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 
'kubernetes.taskmanager.cpu.amount'
2024-03-05 04:35:46,592 o.a.f.c.Configuration          [WARN 
][gdc-gdc-sa/logstream-grand-grand-s8-serverlog-production] Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
2024-03-05 04:35:46,592 o.a.f.c.Configuration          [WARN 
][gdc-gdc-sa/logstream-jinghang-jinghang-g106-seazyi-nginx] Config uses 
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,592 o.a.f.c.Configuration          [WARN 
][gdc-gdc-sa/logstream-jinghang-jinghang-g106-seazyi-nginx] Config uses 
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 
'kubernetes.taskmanager.cpu.amount'
2024-03-05 04:35:46,592 o.a.f.c.Configuration          [WARN 
][gdc-gdc-sa/logstream-jinghang-jinghang-artct-outer-p4-se] Config uses 
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,592 o.a.f.c.Configuration          [WARN 
][gdc-gdc-sa/logstream-jinghang-jinghang-artct-outer-p4-se] Config uses 
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 
'kubernetes.taskmanager.cpu.amount'
2024-03-05 04:35:46,592 o.a.f.c.Configuration          [WARN 
][gdc-qdata-bu/prod-s1-monitor-reward-sjuneizhandoushemenrel] Config uses 
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,593 o.a.f.c.Configuration          [WARN 
][gdc-qdata-bu/prod-s1-monitor-reward-sjuneizhandoushemenrel] Config uses 
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 
'kubernetes.taskmanager.cpu.amount'
2024-03-05 04:35:46,593 o.a.f.c.Configuration          [WARN 
][gdc-gdc-sa/logstream-panama-panama-h73na-serverlog-produ] Config uses 
deprecated configuration key 'kubernetes.jobmanager.cpu' instead of proper key 
'kubernetes.jobmanager.cpu.amount'
2024-03-05 04:35:46,593 o.a.f.c.Configuration          [WARN 
][gdc-gdc-sa/logstream-panama-panama-h73na-serverlog-produ] Config uses 
deprecated configuration key 'kubernetes.taskmanager.cpu' instead of proper key 
'kubernetes.taskmanager.cpu.amount' {code}
 
OperatorB log when switching, take it from es (the format is a little different 
from the above log file)
 
{code:java}
-- Meters ---------------------------------------------------------------------
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.NumPerSecond:
 0.35
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpRequest.NumPerSecond:
 0.35
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.201.NumPerSecond:
 0.0
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.200.NumPerSecond:
 0.3333333333333333
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.101.NumPerSecond:
 0.016666666666666666
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpRequest.Failed.NumPerSecond:
 0.0-- Histograms 
---------------------------------------------------------------------
flink-kubernetes-operator-85f6994468-92xsz.k8soperator.streamfly.flink-kubernetes-operator.system.KubeClient.HttpResponse.TimeNanos:
 count=1000, min=939944, max=49957558, mean=1717875.3819999998, 
stddev=2272368.0974273267, p50=1293964.5, p75=1475059.25, 
p95=3561530.249999989, p98=6726813.320000002, p99=8400899.390000004, 
p999=4.9932472127003446E7=========================== Finished metrics report 
==========================="
2024-03-04T20:35:48.416Z,"2024-03-05 04:35:48,027 INFO  
io.javaoperatorsdk.operator.LeaderElectionManager             - New leader with 
identity: 
"
2024-03-04T20:35:48.416Z,"2024-03-05 04:35:48,121 INFO  
io.javaoperatorsdk.operator.LeaderElectionManager             - New leader with 
identity: flink-kubernetes-operator-85f6994468-92xsz
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,126 INFO  
io.javaoperatorsdk.operator.processing.Controller             - Started event 
processing for controller: flinksessionjobcontroller
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:49,150 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-gdc-sa/logstream-wei-ma65-production] - Config uses deprecated 
configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:49,150 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-nsh-bu/logdistribution-kiel-kiel-nsh-lhall-eos-produ] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,905 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-nsh-bu/logdistribution-kiel-kiel-nsh-lhall-eos-produ] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,905 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-a29-bu/logdistribution-tang-tang-a29-zycenter-hub-pr] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:49,150 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-g117-bu/logdistribution-welland-welland-g117-serverlo] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,901 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-gdc-sa/logstream-jinghang-jinghang-opd-java-fs-log-p] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,913 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-g117-bu/logdistribution-welland-welland-g117-serverlo] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.420Z,"2024-03-05 04:35:48,921 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-qdata-bu/prod-g17-reward-dynamic-huodongchangzhuhuodon] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.421Z,"2024-03-05 04:35:49,150 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-qdata-bu/prod-g48-monitor-reward-xinzengdaojujiankong] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.421Z,"2024-03-05 04:35:48,920 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-qdata-bu/prod-g17-reward-dynamic-reward-huodongchangzh] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.421Z,"2024-03-05 04:35:49,150 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-gdc-sa/logstream-jinghang-jinghang-opd-java-fs-log-p] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.421Z,"2024-03-05 04:35:48,919 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-gdc-sa/logstream-panama-panama-h72-hexfps-proxima-pr] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type'
"
2024-03-04T20:35:49.421Z,"2024-03-05 04:35:49,150 WARN  
org.apache.flink.configuration.Configuration                 
[gdc-qdata-bu/prod-g17-reward-dynamic-reward-huodongchangzh] - Config uses 
deprecated configuration key 'high-availability' instead of proper key 
'high-availability.type' {code}
 

 

> Flink deployment keep staying at RECONCILING/STABLE status
> ----------------------------------------------------------
>
>                 Key: FLINK-34576
>                 URL: https://issues.apache.org/jira/browse/FLINK-34576
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.6.1
>            Reporter: chenyuzhi
>            Priority: Major
>         Attachments: image-2024-03-05-15-13-11-032.png
>
>
> The HA mode of flink-kubernetes-operator is being used. When one of the pods 
> of flink-kubernetes-operator restarts, flink-kubernetes-operator switches the 
> leader. However, some flinkdeployments have been in the 
> *JOB_STATUS=RECONCILING&LIFECYCLE_STATE=STABLE* state for a long time.
> Through the cmd "kubectl describe flinkdeployment xxx", can see the following 
> error, but there are no exceptions in the flink-kubernetes-operator log.
>  
> {code:java}
> Status:
>   Cluster Info:
>     Flink - Revision:             b6d20ed @ 2023-12-20T10:01:39+01:00
>     Flink - Version:              1.14.0-GDC1.6.0
>     Total - Cpu:                  7.0
>     Total - Memory:               30064771072
>   Error:                          
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException:
>  java.lang.RuntimeException: Failed to load 
> configuration","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.shaded.guava30.com.google.common.util.concurrent.UncheckedExecutionException","message":"java.lang.RuntimeException:
>  Failed to load 
> configuration","additionalMetadata":{}},{"type":"java.lang.RuntimeException","message":"Failed
>  to load configuration","additionalMetadata":{}}]}
>   Job Manager Deployment Status:  READY
>   Job Status:
>     Job Id:    cf44b5e73a1f263dd7d9f2c82be5216d
>     Job Name:  noah_stream_studio_1754211682_2218100380
>     Savepoint Info:
>       Last Periodic Savepoint Timestamp:  0
>       Savepoint History:
>     Start Time:     1705635107137
>     State:          RECONCILING
>     Update Time:    1709272530741
>   Lifecycle State:  STABLE {code}
>  
> !image-2024-03-05-15-13-11-032.png!
>  
> version：
> flink-kubernetes-operator: 1.6.1
> flink: 1.14.0/1.15.2 (flinkdeployment 1200+)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-34576) Flink deployment keep staying at RECONCILING/STABLE status

Reply via email to