[ https://issues.apache.org/jira/browse/FLINK-32774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maximilian Michels updated FLINK-32774: --------------------------------------- Fix Version/s: kubernetes-operator-1.6.0 > Reconciliation for autoscaling overrides gets stuck after > cancel-with-savepoint > ------------------------------------------------------------------------------- > > Key: FLINK-32774 > URL: https://issues.apache.org/jira/browse/FLINK-32774 > Project: Flink > Issue Type: Bug > Components: Autoscaler, Kubernetes Operator > Affects Versions: kubernetes-operator-1.6.0 > Reporter: Maximilian Michels > Assignee: Maximilian Michels > Priority: Critical > Fix For: kubernetes-operator-1.6.0 > > > Since https://issues.apache.org/jira/browse/FLINK-32589 the operator does not > rely on the Flink configuration anymore to store the parallelism overrides. > Instead, it stores them internally in the autoscaler config map. Upon > scalings without the rescaling API, the spec is changed on the fly during > reconciliation and the parallelism overrides are added. > Unfortunately, this yields to the cluster getting stuck with the job in > FINISHED state after taking a savepoint for upgrade. The operator assumes > that the new cluster got deployed successfully and goes into DEPLOYED state > again. > Log flow (from oldest to newest): > # Rescheduling new reconciliation immediately to execute scaling operation. > # Upgrading/Restarting running job, suspending first... > # Job is in running state, ready for upgrade with SAVEPOINT > # Suspending existing deployment. > # Suspending job with savepoint. > # Job successfully suspended with savepoint > # The resource is being upgraded > # Pending upgrade is already deployed, updating status. > # Observing JobManager deployment. Previous status: DEPLOYING > # JobManager deployment port is ready, waiting for the Flink REST API... > # DEPLOYED The resource is deployed/submitted to Kubernetes, but it’s not > yet considered to be stable and might be rolled back in the future > It appears the issue might be in (8): > [https://github.com/apache/flink-kubernetes-operator/blob/c09671c5c51277c266b8c45d493317d3be1324c0/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L260] > because the generation id hasn't been changed by the mere parallelism > override change. -- This message was sent by Atlassian Jira (v8.20.10#820010)