[ 
https://issues.apache.org/jira/browse/FLINK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Castelli updated FLINK-37320:
----------------------------------
    Attachment: operator-log-finite-streaming-job.log

> FINISHED jobs incorrectly being set to RECONCILING
> --------------------------------------------------
>
>                 Key: FLINK-37320
>                 URL: https://issues.apache.org/jira/browse/FLINK-37320
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.10.0
>         Environment: I've attached the flinkdeployment CR and operator-config 
> I used to locally replicate.
>            Reporter: Luca Castelli
>            Priority: Minor
>         Attachments: operator-config.yaml, operator-log-batch-job.log, 
> operator-log-finite-streaming-job.log, test-batch-job.yaml
>
>
> Hello,
> I believe I've found bugs within the observation logic for both finite 
> streaming and batch jobs. This is a follow-up to: 
> [https://lists.apache.org/thread/xvsk4fmlqln092cdolvox4dgko0pw81k].
> *For finite streaming jobs:*
>  # The job finishes successfully and the job status changes to FINISHED
>  # TTL (kubernetes.operator.jm-deployment.shutdown-ttl) cleanup removes the 
> JM deployments and clears HA configmap data
>  # On the next loop, the observer sees MISSING JM and changes the job status 
> from FINISHED to RECONCILING
> The job had reached a terminal state. It shouldn't have been set back to 
> RECONCILING.
> This leads to an operator error later when a recovery attempt is triggered. 
> The recovery is triggered because the JM is MISSING, the status is 
> RECONCILING, spec shows RUNNING, and HA enabled. The recovery fails with 
> validateHaMetadataExists throwing UpgradeFailureException.
> At that point the deployment gets stuck in a loop with status RECONCILING and 
> UpgradeFailureException thrown on each cycle. I've attached operator logs 
> showing this.
> I think the fix would be to wrap 
> [https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L155]
>  in an if-statement that checks the job is not in a terminal state. Happy to 
> discuss and/or put up the 2 line code change PR.
> *For batch jobs:*
>  # Batch jobs don't use checkpointing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to