[ https://issues.apache.org/jira/browse/FLINK-29620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17618683#comment-17618683 ]
liad shachoach commented on FLINK-29620: ---------------------------------------- i'll try it tomorrow, thanks :) > Flink deployment stuck in UPGRADING state when changing configuration > --------------------------------------------------------------------- > > Key: FLINK-29620 > URL: https://issues.apache.org/jira/browse/FLINK-29620 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: 1.14.2 > Environment: AWS EKS v1.21 > Operator version: 1.1.0 > Reporter: liad shachoach > Priority: Major > > When I update the configuration of a flink deployment I observe one of two > scenarios: > Success: > This happens when the job has not started - if I change the configuration > quick enough: > {code:java} > 2022-10-13 06:50:54,336 o.a.f.k.o.r.d.AbstractJobReconciler [INFO > ][load-streaming/validator-process-124] Upgrading/Restarting running job, > suspending first... > 2022-10-13 06:50:54,343 o.a.f.k.o.r.d.ApplicationReconciler [INFO > ][load-streaming/validator-process-124] Job is not running but HA metadata is > available for last state restore, ready for upgrade > 2022-10-13 06:50:54,353 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Deleting JobManager deployment while > preserving HA metadata. > 2022-10-13 06:50:58,415 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s) > 2022-10-13 06:51:03,451 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s) > 2022-10-13 06:51:06,469 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Cluster shutdown completed. > 2022-10-13 06:51:06,470 o.a.f.k.o.c.FlinkDeploymentController [INFO > ][load-streaming/validator-process-124] End of reconciliation > 2022-10-13 06:51:06,493 o.a.f.k.o.c.FlinkDeploymentController [INFO > ][load-streaming/validator-process-124] Starting reconciliation > 2022-10-13 06:51:06,494 o.a.f.k.o.c.FlinkConfigManager [INFO > ][load-streaming/validator-process-124] Generating new config > {code} > In this scenario I see that the job manager and task manager pods are > terminated and then recreated. > > > Failure: > This happens when I let the job start (wait more than 30-60 seconds) and > change the configuration: > {code:java} > 2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO > ][load-streaming/validator-process-124] Upgrading/Restarting running job, > suspending first... > 2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO > ][load-streaming/validator-process-124] Job is in running state, ready for > upgrade with SAVEPOINT > 2022-10-13 06:53:06,659 o.a.f.k.o.s.FlinkService [INFO > ][load-streaming/validator-process-124] Suspending job with savepoint. > 2022-10-13 06:53:07,042 o.a.f.k.o.s.FlinkService [INFO > ][load-streaming/validator-process-124] Job successfully suspended with > savepoint > s3://cu-flink-load-checkpoints-us-east-1/validator-process-124/savepoints/savepoint-000000-947975b509b2. > 2022-10-13 06:53:11,111 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s) > 2022-10-13 06:53:16,176 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s) > 2022-10-13 06:53:21,238 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (15s) > 2022-10-13 06:53:26,293 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (20s) > 2022-10-13 06:53:31,355 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (25s) > 2022-10-13 06:53:36,412 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (30s) > 2022-10-13 06:53:41,512 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (35s) > 2022-10-13 06:53:46,568 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (40s) > 2022-10-13 06:53:51,625 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (45s) > 2022-10-13 06:53:56,740 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (50s) > 2022-10-13 06:54:01,811 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (55s) > 2022-10-13 06:54:06,866 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Waiting for cluster shutdown... (60s) > 2022-10-13 06:54:07,866 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Cluster shutdown completed. > 2022-10-13 06:54:07,866 o.a.f.k.o.c.FlinkDeploymentController [INFO > ][load-streaming/validator-process-124] End of reconciliation > 2022-10-13 06:54:07,894 o.a.f.k.o.c.FlinkDeploymentController [INFO > ][load-streaming/validator-process-124] Starting reconciliation > 2022-10-13 06:54:07,894 o.a.f.k.o.o.d.ApplicationObserver [WARN > ][load-streaming/validator-process-124] Running deployment generation 3 > doesn't match upgrade target generation 4. > 2022-10-13 06:54:07,895 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO > ][load-streaming/validator-process-124] Detected spec change, starting > reconciliation. > 2022-10-13 06:54:07,941 o.a.f.k.o.s.FlinkService [INFO > ][load-streaming/validator-process-124] Deploying application cluster > 2022-10-13 06:54:07,947 o.a.f.k.o.u.FlinkUtils [INFO > ][load-streaming/validator-process-124] Job graph in ConfigMap > validator-process-124-dispatcher-leader is deleted > 2022-10-13 06:54:08,029 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO > ][load-streaming/validator-process-124] Submitting application in > 'Application Mode'. > 2022-10-13 06:54:08,031 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO > ][load-streaming/validator-process-124] The derived from fraction jvm > overhead memory (102.400mb (107374184 bytes)) is less than its min value > 192.000mb (201326592 bytes), min value will be used instead > 2022-10-13 06:54:08,087 o.a.f.k.o.r.ReconciliationUtils [WARN > ][load-streaming/validator-process-124] Attempt count: 0, last attempt: false > 2022-10-13 06:54:08,111 i.j.o.p.e.ReconciliationDispatcher > [ERROR][load-streaming/validator-process-124] Error during event processing > ExecutionScope{ resource id: ResourceID{name='validator-process-124', > namespace='load-streaming'}, version: 1116792084} failed. > org.apache.flink.kubernetes.operator.exception.ReconciliationException: > org.apache.flink.client.deployment.ClusterDeploymentException: The Flink > cluster validator-process-124 already exists. > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:119) > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54) > at > io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201) > at > io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153) > at > org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:83) > at > io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59) > at > io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: The > Flink cluster validator-process-124 already exists. > at > org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:181) > at > org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67) > at > org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:200) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:155) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:52) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:188) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:145) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:55) > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:115) > ... 13 more {code} > In this scenario I see that the job manager pod is restarted (not recreated), > task manager pods are not updated, flink config maps are not updated. > The flink deployment state changes to UPGRADING and the above exception is > repeated. > error in flink deployment: > org.apache.flink.client.deployment.ClusterDeploymentException: The Flink > cluster validator-process-124 already exists. > Job Manager Deployment Status: MISSING > > Flink deployment spec: > > {code:java} > flinkVersion: v1_14 > job: > allowNonRestoredState: true > args: ... > entryClass: ... > jarURI: ... > parallelism: x > savepointTriggerNonce: 0 > state: running > upgradeMode: savepoint > jobManager: > podTemplate: > apiVersion: v1 > kind: Pod > metadata: > annotations: > configmap.reloader.stakater.com/reload: > flink-config-validator-process-124,pod-template-validator-process-124 > spec: > affinity: > nodeAffinity: > requiredDuringSchedulingIgnoredDuringExecution: > nodeSelectorTerms: > - matchExpressions: > - key: nodeType > operator: In > values: > - someValue > containers: > - name: flink-main-container > resources: > limits: > cpu: "1" > memory: 1.6Gi > requests: > cpu: "0.2" > memory: 1Gi > tolerations: > - effect: NoSchedule > key: someValue > value: "true" > replicas: 1 > podTemplate: > apiVersion: v1 > kind: Pod > metadata: > annotations: > configmap.reloader.stakater.com/reload: > flink-config-validator-process-124,pod-template-validator-process-124 > prometheus.io/path: /metrics > prometheus.io/port: "9260" > prometheus.io/scrape: "true" > labels: > app.kubernetes.io/instance: flink-validator-process-124 > app.kubernetes.io/managed-by: Helm > app.kubernetes.io/name: apache-flink > app.kubernetes.io/version: test > helm.sh/chart: apache-flink-1.0.0 > spec: > containers: [] > imagePullSecrets: [] > serviceAccount: validator-process-124 > taskManager: > podTemplate: > apiVersion: v1 > kind: Pod > metadata: > annotations: > configmap.reloader.stakater.com/reload: > flink-config-validator-process-124,pod-template-validator-process-124 > spec: > affinity: > nodeAffinity: > requiredDuringSchedulingIgnoredDuringExecution: > nodeSelectorTerms: > - matchExpressions: > - key: nodeType > operator: In > values: > - someValue > containers: > - name: flink-main-container > resources: > limits: > cpu: "1" > memory: 3.6Gi > requests: > cpu: "0.2" > memory: 3Gi > tolerations: > - effect: NoSchedule > key: someValue > value: "true"{code} > > Please let me know if more details are required. > -- This message was sent by Atlassian Jira (v8.20.10#820010)