Luca Castelli created FLINK-37320:
-------------------------------------
Summary: FINISHED jobs incorrectly being set to RECONCILING
Key: FLINK-37320
URL: https://issues.apache.org/jira/browse/FLINK-37320
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.10.0
Environment: I've attached the flinkdeployment CR and operator-config
I used to locally replicate.
Reporter: Luca Castelli
Attachments:
flink-kubernetes-operator-deploy-flink-kubernetes-operator-6f97d96777-8k2d4-1739457686217038000.log,
operator-config.yaml, test-batch-job.yaml
Hello,
I believe we've found a bug within the observation logic for finite streaming
or batch jobs. This is a follow-up to [this dev mailing list
post](https://lists.apache.org/thread/xvsk4fmlqln092cdolvox4dgko0pw81k)
# The job finishes successfully and the job status changes to FINISHED
# TTL (kubernetes.operator.jm-deployment.shutdown-ttl) cleanup removes the JM
deployments and clears HA configmap data
# On the next loop, the observer sees MISSING JM and changes the job status
from FINISHED to RECONCILING
The job had reached a terminal state. It shouldn't have been set back to
RECONCILING.
This leads to an operator error later when a recovery attempt is triggered. The
recovery is triggered because the JM is MISSING, the status is RECONCILING,
spec shows RUNNING, and HA enabled. The recovery fails because
validateHaMetadataExists throws UpgradeFailureException.
At that point the deployment gets stuck in a RECONCILING loop with
UpgradeFailureException thrown on each cycle. I've attached operator logs
showing this.
I think the fix would be to wrap
[this](https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L155)
in an if-statement that checks the job is not in a terminal state.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)