[ https://issues.apache.org/jira/browse/FLINK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luca Castelli updated FLINK-37320: ---------------------------------- Summary: [Observer] FINISHED finite streaming jobs incorrectly being set to RECONCILING (was: FINISHED finite streaming jobs incorrectly being set to RECONCILING) > [Observer] FINISHED finite streaming jobs incorrectly being set to RECONCILING > ------------------------------------------------------------------------------ > > Key: FLINK-37320 > URL: https://issues.apache.org/jira/browse/FLINK-37320 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.10.0 > Environment: I've attached the flinkdeployment CR and operator-config > I used to locally replicate. > Reporter: Luca Castelli > Priority: Minor > Labels: pull-request-available > Attachments: operator-config.yaml, > operator-log-finite-streaming-job.log, test-finite-streaming-job.yaml > > > Hello, > I believe I've found a bug within the observation logic for finite streaming > jobs. This is a follow-up to: > [https://lists.apache.org/thread/xvsk4fmlqln092cdolvox4dgko0pw81k]. > *For finite streaming jobs:* > # The job finishes successfully and the job status changes to FINISHED > # TTL (kubernetes.operator.jm-deployment.shutdown-ttl) cleanup removes the > JM deployments and clears HA configmap data > # On the next loop, the observer sees MISSING JM and changes the job status > from FINISHED to RECONCILING > The job had reached a terminal state. It shouldn't have been set back to > RECONCILING. > This leads to an operator error later when a recovery attempt is triggered. > The recovery is triggered because the JM is MISSING, the status is > RECONCILING, spec shows RUNNING, and HA enabled. The recovery fails with > validateHaMetadataExists throwing UpgradeFailureException. > At that point the deployment gets stuck in a loop with status RECONCILING and > UpgradeFailureException thrown on each cycle. I've attached operator logs > showing this. > *Proposed solution:* I think the fix would be to wrap > [AbstractFlinkDeploymentObserver.observeJmDeployment|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L155] > in an if-statement that checks the job is not in a terminal state. Happy to > discuss and/or put up the 2 line code change PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)