liad shachoach created FLINK-29620:
--------------------------------------

             Summary: Flink deployment stuck in UPGRADING state when changing 
configuration
                 Key: FLINK-29620
                 URL: https://issues.apache.org/jira/browse/FLINK-29620
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: 1.14.2
         Environment: AWS EKS v1.21

Operator version: 1.1.0
            Reporter: liad shachoach


When I update the configuration of a flink deployment I observe one of two 
scenarios:

Success:

This happens when the job has not started - if I change the configuration quick 
enough:
{code:java}
2022-10-13 06:50:54,336 o.a.f.k.o.r.d.AbstractJobReconciler [INFO 
][load-streaming/validator-process-124] Upgrading/Restarting running job, 
suspending first...
2022-10-13 06:50:54,343 o.a.f.k.o.r.d.ApplicationReconciler [INFO 
][load-streaming/validator-process-124] Job is not running but HA metadata is 
available for last state restore, ready for upgrade
2022-10-13 06:50:54,353 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Deleting JobManager deployment while 
preserving HA metadata.
2022-10-13 06:50:58,415 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
2022-10-13 06:51:03,451 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
2022-10-13 06:51:06,469 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Cluster shutdown completed.
2022-10-13 06:51:06,470 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][load-streaming/validator-process-124] End of reconciliation
2022-10-13 06:51:06,493 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][load-streaming/validator-process-124] Starting reconciliation
2022-10-13 06:51:06,494 o.a.f.k.o.c.FlinkConfigManager [INFO 
][load-streaming/validator-process-124] Generating new config
 {code}
In this scenario I see that the job manager and task manager pods are 
terminated and then recreated.

 

 

Failure:

This happens when I let the job start (wait more than 30-60 seconds) and change 
the configuration:
{code:java}
2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO 
][load-streaming/validator-process-124] Upgrading/Restarting running job, 
suspending first...
2022-10-13 06:53:06,637 o.a.f.k.o.r.d.AbstractJobReconciler [INFO 
][load-streaming/validator-process-124] Job is in running state, ready for 
upgrade with SAVEPOINT
2022-10-13 06:53:06,659 o.a.f.k.o.s.FlinkService       [INFO 
][load-streaming/validator-process-124] Suspending job with savepoint.
2022-10-13 06:53:07,042 o.a.f.k.o.s.FlinkService       [INFO 
][load-streaming/validator-process-124] Job successfully suspended with 
savepoint 
s3://cu-flink-load-checkpoints-us-east-1/validator-process-124/savepoints/savepoint-000000-947975b509b2.
2022-10-13 06:53:11,111 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (5s)
2022-10-13 06:53:16,176 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (10s)
2022-10-13 06:53:21,238 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (15s)
2022-10-13 06:53:26,293 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (20s)
2022-10-13 06:53:31,355 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (25s)
2022-10-13 06:53:36,412 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (30s)
2022-10-13 06:53:41,512 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (35s)
2022-10-13 06:53:46,568 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (40s)
2022-10-13 06:53:51,625 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (45s)
2022-10-13 06:53:56,740 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (50s)
2022-10-13 06:54:01,811 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (55s)
2022-10-13 06:54:06,866 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Waiting for cluster shutdown... (60s)
2022-10-13 06:54:07,866 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Cluster shutdown completed.
2022-10-13 06:54:07,866 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][load-streaming/validator-process-124] End of reconciliation
2022-10-13 06:54:07,894 o.a.f.k.o.c.FlinkDeploymentController [INFO 
][load-streaming/validator-process-124] Starting reconciliation
2022-10-13 06:54:07,894 o.a.f.k.o.o.d.ApplicationObserver [WARN 
][load-streaming/validator-process-124] Running deployment generation 3 doesn't 
match upgrade target generation 4.
2022-10-13 06:54:07,895 o.a.f.k.o.r.d.AbstractFlinkResourceReconciler [INFO 
][load-streaming/validator-process-124] Detected spec change, starting 
reconciliation.
2022-10-13 06:54:07,941 o.a.f.k.o.s.FlinkService       [INFO 
][load-streaming/validator-process-124] Deploying application cluster
2022-10-13 06:54:07,947 o.a.f.k.o.u.FlinkUtils         [INFO 
][load-streaming/validator-process-124] Job graph in ConfigMap 
validator-process-124-dispatcher-leader is deleted
2022-10-13 06:54:08,029 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO 
][load-streaming/validator-process-124] Submitting application in 'Application 
Mode'.
2022-10-13 06:54:08,031 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO 
][load-streaming/validator-process-124] The derived from fraction jvm overhead 
memory (102.400mb (107374184 bytes)) is less than its min value 192.000mb 
(201326592 bytes), min value will be used instead
2022-10-13 06:54:08,087 o.a.f.k.o.r.ReconciliationUtils [WARN 
][load-streaming/validator-process-124] Attempt count: 0, last attempt: false
2022-10-13 06:54:08,111 i.j.o.p.e.ReconciliationDispatcher 
[ERROR][load-streaming/validator-process-124] Error during event processing 
ExecutionScope{ resource id: ResourceID{name='validator-process-124', 
namespace='load-streaming'}, version: 1116792084} failed.
org.apache.flink.kubernetes.operator.exception.ReconciliationException: 
org.apache.flink.client.deployment.ClusterDeploymentException: The Flink 
cluster validator-process-124 already exists.
    at 
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:119)
    at 
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:54)
    at 
io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:201)
    at 
io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:153)
    at 
org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics.timeControllerExecution(OperatorJosdkMetrics.java:83)
    at 
io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:152)
    at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:135)
    at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:115)
    at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:86)
    at 
io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:59)
    at 
io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:390)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: The 
Flink cluster validator-process-124 already exists.
    at 
org.apache.flink.kubernetes.KubernetesClusterDescriptor.deployApplicationCluster(KubernetesClusterDescriptor.java:181)
    at 
org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:67)
    at 
org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:200)
    at 
org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:155)
    at 
org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deploy(ApplicationReconciler.java:52)
    at 
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.restoreJob(AbstractJobReconciler.java:188)
    at 
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractJobReconciler.reconcileSpecChange(AbstractJobReconciler.java:122)
    at 
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:145)
    at 
org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler.reconcile(AbstractFlinkResourceReconciler.java:55)
    at 
org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:115)
    ... 13 more {code}
In this scenario I see that the job manager pod is restarted (not recreated), 
task manager pods are not updated, flink config maps are not updated.

The flink deployment state changes to UPGRADING and the above exception is 
repeated.

 

Flink deployment spec:

 
{code:java}
flinkVersion: v1_14 
job:
    allowNonRestoredState: true
    args: ...
    entryClass: ...
    jarURI: ...
    parallelism: x
    savepointTriggerNonce: 0
    state: running
    upgradeMode: savepoint
jobManager:
    podTemplate:
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          configmap.reloader.stakater.com/reload: 
flink-config-validator-process-124,pod-template-validator-process-124
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nodeType
                  operator: In
                  values:
                  - someValue
        containers:
        - name: flink-main-container
          resources:
            limits:
              cpu: "1"
              memory: 1.6Gi
            requests:
              cpu: "0.2"
              memory: 1Gi
        tolerations:
        - effect: NoSchedule
          key: someValue
          value: "true"
    replicas: 1
podTemplate:
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        configmap.reloader.stakater.com/reload: 
flink-config-validator-process-124,pod-template-validator-process-124
        prometheus.io/path: /metrics
        prometheus.io/port: "9260"
        prometheus.io/scrape: "true"
      labels:
        app.kubernetes.io/instance: flink-validator-process-124
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: apache-flink
        app.kubernetes.io/version: test
        helm.sh/chart: apache-flink-1.0.0
    spec:
      containers: []
      imagePullSecrets: []
  serviceAccount: validator-process-124
  taskManager:
    podTemplate:
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          configmap.reloader.stakater.com/reload: 
flink-config-validator-process-124,pod-template-validator-process-124
      spec:
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nodeType
                  operator: In
                  values:
                  - someValue
        containers:
        - name: flink-main-container
          resources:
            limits:
              cpu: "1"
              memory: 3.6Gi
            requests:
              cpu: "0.2"
              memory: 3Gi
        tolerations:
        - effect: NoSchedule
          key: someValue
          value: "true"{code}
 

Please let me know if more details are required.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to