Re: flink-kubernetes-operator cannot handle SPECCHANGE for 100+ FlinkDeployments concurrently

Tony Chen Thu, 02 Nov 2023 19:15:11 -0700

One of the operator pods logged the following exception before the
container restarted:


�[m�[33m2023-11-01 14:24:21,260�[m �[36mo.a.f.s.n.i.n.c.AbstractChannel�[m
�[33m[WARN ] Force-closing a channel whose registration task was not
accepted by an event loop: [id: 0x1a7718c1]
java.util.concurrent.RejectedExecutionException: event executor terminated

I did notice that all of our 3 operator pods were reconciling
FlinkDeployments, and this definitely is an issue. After I churned 2 of the
pods, there was only 1 pod that was the leader, and this operator pod was
able to reconcile SPECCHANGES of FlinkDeployments again.

Are there any recommendations on how I can enforce only 1 pod to be the
leader? For example, would increasing the lease-duration help?

https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/configuration/

On Wed, Nov 1, 2023 at 11:16 PM Tony Chen <tony.ch...@robinhood.com> wrote:

> Hi Flink Community,
>
> I am currently running flink-kubernetes-operator 1.6-patched (
> https://github.com/apache/flink-kubernetes-operator/commit/3f0dc2ee5534084bc162e6deaded36e93bb5e384),
> and I have 3 flink-kubernetes-operator pods running. Recently, I deployed
> around 110 new FlinkDeployments, and I had no issues with this initial
> deployment. However, when I applied changes to all of these 110 new
> FlinkDeployments concurrently to update their container image, the
> flink-kubernetes-operator pods seemed to be in conflict with each other
> constantly.
>
> For example, before the SPECCHANGE, FlinkDeployment rh-flinkdeployment-01
> would be RUNNING (status.jobStatus.state) and STABLE
> (status.lifecycleState). After the FlinkDeployment spec is updated,
> rh-flinkdeployment-01 goes through FINISHED (status.jobStatus.state) and
> UPGRADING (status.jobStatus.state), and then RECONCILING
> (status.jobStatus.state) and DEPLOYED (status.jobStatus.state). It reaches
> RUNNING and STABLE again, but then for some reason it goes back to FINISHED
> and UPGRADING again, and I do notice that the newly created jobmanager pod
> gets deleted and then recreated. rh-flinkdeployment-01 basically becomes
> stuck in this loop where it becomes stable and then gets re-deployed by the
> operator.
>
> This doesn't happen to all 110 FlinkDeployments, but it happens to around
> 30 of them concurrently.
>
> I have pasted some logs from one of the operator pods on one of the
> FlinkDeployments. I have also highlighted messages that seem suspicious to
> me. I will try to gather more logs and send them tomorrow.
>
> For now, to mitigate this, I had to delete all of these FlinkDeployments
> and run them with the deprecated GoogleCloudPlatform operator. I'm hoping
> to resolve this soon so that I don't have to run anything on the
> GoogleCloudPlatform operator anymore.
>
> Thanks!
> Tony
>
>
> �[m�[33m2023-11-02 05:26:02,132�[m
> �[36mi.j.o.p.e.ReconciliationDispatcher�[m
> �[1;31m[ERROR][<namespace>/<flinkdeployment>] Error during event processing
> ExecutionScope{ resource id: ResourceID{name='<flinkdeployment',
> namespace='<namespace>'}, version: 17772349729} failed.
> org.apache.flink.kubernetes.operator.exception.ReconciliationException:
> org.apache.flink.kubernetes.operator.exception.StatusConflictException:
> Status have been modified externally in version 17772349851 Previous:
> <REDACTED>
> ...
> 2023-11-02 05:27:25,945 o.a.f.k.o.o.d.ApplicationObserver [WARN
> ][<namespace>/<flinkdeployment>] *Running deployment generation -1
> doesn't match upgrade target generation 2.*
> 2023-11-02 05:27:25,946 o.a.f.c.Configuration          [WARN
> ][<namespace>/<flinkdeployment>] Config uses deprecated configuration key
> 'high-availability' instead of proper key 'high-availability.type'
> 2023-11-02 05:27:26,034 o.a.f.k.o.l.AuditUtils         [INFO
> ][<namespace>/<flinkdeployment>] >>> Status | Info    | UPGRADING       |
> The resource is being upgraded
> 2023-11-02 05:27:26,057 o.a.f.k.o.l.AuditUtils         [INFO
> ][<namespace>/<flinkdeployment>] >>> Event  | Info    | SUBMIT          |
> Starting deployment
> 2023-11-02 05:27:26,057 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][<namespace>/<flinkdeployment>] Deploying application cluster requiring
> last-state from HA metadata
> 2023-11-02 05:27:26,057 o.a.f.c.Configuration          [WARN
> ][<namespace>/<flinkdeployment>] Config uses deprecated configuration key
> 'high-availability' instead of proper key 'high-availability.type'
> 2023-11-02 05:27:26,084 o.a.f.c.Configuration          [WARN
> ][<namespace>/<flinkdeployment>] Config uses deprecated configuration key
> 'high-availability' instead of proper key 'high-availability.type'
> 2023-11-02 05:27:26,110 o.a.f.k.o.s.NativeFlinkService [INFO
> ][<namespace>/<flinkdeployment>] Deploying application cluster
> 2023-11-02 05:27:26,110 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO
> ][<namespace>/<flinkdeployment>] Submitting application in 'Application
> Mode'.
> 2023-11-02 05:27:26,112 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
> ][<namespace>/<flinkdeployment>] The derived from fraction jvm overhead
> memory (1.000gb (1073741840 bytes)) is greater than its max value
> 1024.000mb (1073741824 bytes), max value will be used instead
> 2023-11-02 05:27:26,112 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
> ][<namespace>/<flinkdeployment>] The derived from fraction jvm overhead
> memory (1.000gb (1073741840 bytes)) is greater than its max value
> 1024.000mb (1073741824 bytes), max value will be used instead
> 2023-11-02 05:27:26,163 o.a.f.k.o.s.AbstractFlinkService [INFO
> ][<namespace>/<flinkdeployment>] Waiting for cluster shutdown... (30s)
> 2023-11-02 05:27:26,193 o.a.f.k.o.l.AuditUtils         [INFO
> ][<namespace>/<flinkdeployment>] >>> Event  | Warning | 
> *CLUSTERDEPLOYMENTEXCEPTION
> | The Flink cluster <flinkdeployment> already exists.*
> 2023-11-02 05:27:26,193 o.a.f.k.o.r.ReconciliationUtils [WARN
> ][<namespace>/<flinkdeployment>] Attempt count: 0, last attempt: false
> 2023-11-02 05:27:26,277 o.a.f.k.o.l.AuditUtils         [INFO
> ][<namespace>/<flinkdeployment>] *>>> Status | Error   | UPGRADING
> |
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException:
> The Flink cluster <flinkdeployment> already
> exists.","additionalMetadata":{},"throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"The
> Flink cluster <flinkdeployment> already exists.","additionalMetadata":{}}]}*
>
>
> --
>
> <http://www.robinhood.com/>
>
> Tony Chen
>
> Software Engineer
>
> Menlo Park, CA
>
> Don't copy, share, or use this email without permission. If you received
> it by accident, please let us know and then delete it right away.
>


-- 

<http://www.robinhood.com/>

Tony Chen

Software Engineer

Menlo Park, CA

Don't copy, share, or use this email without permission. If you received it
by accident, please let us know and then delete it right away.

Re: flink-kubernetes-operator cannot handle SPECCHANGE for 100+ FlinkDeployments concurrently

Reply via email to