Re: flink-kubernetes-operator cannot handle SPECCHANGE for 100+ FlinkDeployments concurrently

Evgeniy Lyutikov Fri, 03 Nov 2023 01:57:33 -0700

Hello!

I constantly get a similar error when operator (working in single instance) 
receiving deployment statuses
Details described in this message 
https://lists.apache.org/thread/0odcc9pvlpz1x9y2nop9dlmcnp9v1696
I tried changing versions and allocated resources, as well as the number of 
reconcile threads, but nothing helped



________________________________
От: Tony Chen <tony.ch...@robinhood.com>
Отправлено: 3 ноября 2023 г. 9:13:51
Кому: user@flink.apache.org
Копия: Nathan Moderwell
Тема: Re: flink-kubernetes-operator cannot handle SPECCHANGE for 100+ 
FlinkDeployments concurrently

One of the operator pods logged the following exception before the container 
restarted:

�[m�[33m2023-11-01 14:24:21,260�[m �[36mo.a.f.s.n.i.n.c.AbstractChannel�[m 
�[33m[WARN ] Force-closing a channel whose registration task was not accepted 
by an event loop: [id: 0x1a7718c1]
java.util.concurrent.RejectedExecutionException: event executor terminated

I did notice that all of our 3 operator pods were reconciling FlinkDeployments, 
and this definitely is an issue. After I churned 2 of the pods, there was only 
1 pod that was the leader, and this operator pod was able to reconcile 
SPECCHANGES of FlinkDeployments again.

Are there any recommendations on how I can enforce only 1 pod to be the leader? 
For example, would increasing the lease-duration help?

https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/<https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnightlies.apache.org%2Fflink%2Fflink-kubernetes-operator-docs-main%2Fdocs%2Foperations%2Fconfiguration%2F&data=05%7C01%7Ceblyutikov%40avito.ru%7C9be0eed45d3643af4c3d08dbdc12b1cd%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C638345745082130170%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Rrhw00SUQD3sOYWi17gjx8M9gGHe8JgrkPY9%2FswNVXk%3D&reserved=0>

On Wed, Nov 1, 2023 at 11:16 PM Tony Chen 
<tony.ch...@robinhood.com<mailto:tony.ch...@robinhood.com>> wrote:
Hi Flink Community,

I am currently running flink-kubernetes-operator 1.6-patched 
(https://github.com/apache/flink-kubernetes-operator/commit/3f0dc2ee5534084bc162e6deaded36e93bb5e384<https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink-kubernetes-operator%2Fcommit%2F3f0dc2ee5534084bc162e6deaded36e93bb5e384&data=05%7C01%7Ceblyutikov%40avito.ru%7C9be0eed45d3643af4c3d08dbdc12b1cd%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C638345745082130170%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OEDCPJVv6wgz6U3PtBVjEVEDQfovjIPAaTxXKJ1GMBU%3D&reserved=0>),
 and I have 3 flink-kubernetes-operator pods running. Recently, I deployed 
around 110 new FlinkDeployments, and I had no issues with this initial 
deployment. However, when I applied changes to all of these 110 new 
FlinkDeployments concurrently to update their container image, the 
flink-kubernetes-operator pods seemed to be in conflict with each other 
constantly.

For example, before the SPECCHANGE, FlinkDeployment rh-flinkdeployment-01 would 
be RUNNING (status.jobStatus.state) and STABLE (status.lifecycleState). After 
the FlinkDeployment spec is updated, rh-flinkdeployment-01 goes through 
FINISHED (status.jobStatus.state) and UPGRADING (status.jobStatus.state), and 
then RECONCILING (status.jobStatus.state) and DEPLOYED 
(status.jobStatus.state). It reaches RUNNING and STABLE again, but then for 
some reason it goes back to FINISHED and UPGRADING again, and I do notice that 
the newly created jobmanager pod gets deleted and then recreated. 
rh-flinkdeployment-01 basically becomes stuck in this loop where it becomes 
stable and then gets re-deployed by the operator.

This doesn't happen to all 110 FlinkDeployments, but it happens to around 30 of 
them concurrently.

I have pasted some logs from one of the operator pods on one of the 
FlinkDeployments. I have also highlighted messages that seem suspicious to me. 
I will try to gather more logs and send them tomorrow.

For now, to mitigate this, I had to delete all of these FlinkDeployments and 
run them with the deprecated GoogleCloudPlatform operator. I'm hoping to 
resolve this soon so that I don't have to run anything on the 
GoogleCloudPlatform operator anymore.

Thanks!
Tony


�[m�[33m2023-11-02 05:26:02,132�[m �[36mi.j.o.p.e.ReconciliationDispatcher�[m 
�[1;31m[ERROR][<namespace>/<flinkdeployment>] Error during event processing 
ExecutionScope{ resource id: ResourceID{name='<flinkdeployment', 
namespace='<namespace>'}, version: 17772349729} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: 
org.apache.flink.kubernetes.operator.exception.StatusConflictException: Status 
have been modified externally in version 17772349851 Previous: <REDACTED>
...
2023-11-02 05:27:25,945 o.a.f.k.o.o.d.ApplicationObserver [WARN 
][<namespace>/<flinkdeployment>] Running deployment generation -1 doesn't match 
upgrade target generation 2.
2023-11-02 05:27:25,946 o.a.f.c.Configuration          [WARN 
][<namespace>/<flinkdeployment>] Config uses deprecated configuration key 
'high-availability' instead of proper key 'high-availability.type'
2023-11-02 05:27:26,034 o.a.f.k.o.l.AuditUtils         [INFO 
][<namespace>/<flinkdeployment>] >>> Status | Info    | UPGRADING       | The 
resource is being upgraded
2023-11-02 05:27:26,057 o.a.f.k.o.l.AuditUtils         [INFO 
][<namespace>/<flinkdeployment>] >>> Event  | Info    | SUBMIT          | 
Starting deployment
2023-11-02 05:27:26,057 o.a.f.k.o.s.AbstractFlinkService [INFO 
][<namespace>/<flinkdeployment>] Deploying application cluster requiring 
last-state from HA metadata
2023-11-02 05:27:26,057 o.a.f.c.Configuration          [WARN 
][<namespace>/<flinkdeployment>] Config uses deprecated configuration key 
'high-availability' instead of proper key 'high-availability.type'
2023-11-02 05:27:26,084 o.a.f.c.Configuration          [WARN 
][<namespace>/<flinkdeployment>] Config uses deprecated configuration key 
'high-availability' instead of proper key 'high-availability.type'
2023-11-02 05:27:26,110 o.a.f.k.o.s.NativeFlinkService [INFO 
][<namespace>/<flinkdeployment>] Deploying application cluster
2023-11-02 05:27:26,110 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO 
][<namespace>/<flinkdeployment>] Submitting application in 'Application Mode'.
2023-11-02 05:27:26,112 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO 
][<namespace>/<flinkdeployment>] The derived from fraction jvm overhead memory 
(1.000gb (1073741840 bytes)) is greater than its max value 1024.000mb 
(1073741824 bytes), max value will be used instead
2023-11-02 05:27:26,112 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO 
][<namespace>/<flinkdeployment>] The derived from fraction jvm overhead memory 
(1.000gb (1073741840 bytes)) is greater than its max value 1024.000mb 
(1073741824 bytes), max value will be used instead
2023-11-02 05:27:26,163 o.a.f.k.o.s.AbstractFlinkService [INFO 
][<namespace>/<flinkdeployment>] Waiting for cluster shutdown... (30s)
2023-11-02 05:27:26,193 o.a.f.k.o.l.AuditUtils         [INFO 
][<namespace>/<flinkdeployment>] >>> Event  | Warning | 
CLUSTERDEPLOYMENTEXCEPTION | The Flink cluster <flinkdeployment> already exists.
2023-11-02 05:27:26,193 o.a.f.k.o.r.ReconciliationUtils [WARN 
][<namespace>/<flinkdeployment>] Attempt count: 0, last attempt: false
2023-11-02 05:27:26,277 o.a.f.k.o.l.AuditUtils         [INFO 
][<namespace>/<flinkdeployment>] >>> Status | Error   | UPGRADING       | 
{"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException:
 The Flink cluster <flinkdeployment> already 
exists.","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"The
 Flink cluster <flinkdeployment> already exists.","additionalMetadata":{}}]}


--


[https://deploy-artifacts.it.robinhood.net/e6566c456677fbade78464d6793eabc78/ic_avatar_email-signature_72px_2x.png]<https://eur04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.robinhood.com%2F&data=05%7C01%7Ceblyutikov%40avito.ru%7C9be0eed45d3643af4c3d08dbdc12b1cd%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C638345745082130170%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DWEKglkxTzWg7huDPHhPWNgfkEUMfU5N3%2BM9CndI9QA%3D&reserved=0>


Tony Chen

Software Engineer

Menlo Park, CA


Don't copy, share, or use this email without permission. If you received it by 
accident, please let us know and then delete it right away.



--


[https://deploy-artifacts.it.robinhood.net/e6566c456677fbade78464d6793eabc78/ic_avatar_email-signature_72px_2x.png]<https://eur04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.robinhood.com%2F&data=05%7C01%7Ceblyutikov%40avito.ru%7C9be0eed45d3643af4c3d08dbdc12b1cd%7Caf0e07b3b90b472392e63fab11dd5396%7C0%7C0%7C638345745082130170%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DWEKglkxTzWg7huDPHhPWNgfkEUMfU5N3%2BM9CndI9QA%3D&reserved=0>


Tony Chen

Software Engineer

Menlo Park, CA


Don't copy, share, or use this email without permission. If you received it by 
accident, please let us know and then delete it right away.


________________________________
“This message contains confidential information/commercial secret. If you are 
not the intended addressee of this message you may not copy, save, print or 
forward it to any third party and you are kindly requested to destroy this 
message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся 
коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного 
сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его 
каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом 
отправителя электронным письмом.”

Re: flink-kubernetes-operator cannot handle SPECCHANGE for 100+ FlinkDeployments concurrently

Reply via email to