[jira] [Updated] (FLINK-37320) FINISHED jobs incorrectly being set to RECONCILING

Luca Castelli (Jira) Thu, 13 Feb 2025 11:50:18 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-37320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Luca Castelli updated FLINK-37320:
----------------------------------
    Description: 
Hello,

I believe we've found a bug within the observation logic for finite streaming 
or batch jobs. This is a follow-up to: 
[https://lists.apache.org/thread/xvsk4fmlqln092cdolvox4dgko0pw81k].
 # The job finishes successfully and the job status changes to FINISHED
 # TTL (kubernetes.operator.jm-deployment.shutdown-ttl) cleanup removes the JM 
deployments and clears HA configmap data
 # On the next loop, the observer sees MISSING JM and changes the job status 
from FINISHED to RECONCILING

The job had reached a terminal state. It shouldn't have been set back to 
RECONCILING.

This leads to an operator error later when a recovery attempt is triggered. The 
recovery is triggered because the JM is MISSING, the status is RECONCILING, 
spec shows RUNNING, and HA enabled. The recovery fails with 
validateHaMetadataExists throwing UpgradeFailureException.

At that point the deployment gets stuck in a RECONCILING loop with 
UpgradeFailureException thrown on each cycle. I've attached operator logs 
showing this.

 

I think the fix would be to wrap 
[https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L155]
 in an if-statement that checks the job is not in a terminal state. Happy to 
discuss or a put up a PR.

  was:
Hello,

I believe we've found a bug within the observation logic for finite streaming 
or batch jobs. This is a follow-up to: 
[https://lists.apache.org/thread/xvsk4fmlqln092cdolvox4dgko0pw81k].
 # The job finishes successfully and the job status changes to FINISHED
 # TTL (kubernetes.operator.jm-deployment.shutdown-ttl) cleanup removes the JM 
deployments and clears HA configmap data
 # On the next loop, the observer sees MISSING JM and changes the job status 
from FINISHED to RECONCILING

The job had reached a terminal state. It shouldn't have been set back to 
RECONCILING.

This leads to an operator error later when a recovery attempt is triggered. The 
recovery is triggered because the JM is MISSING, the status is RECONCILING, 
spec shows RUNNING, and HA enabled. The recovery fails because 
validateHaMetadataExists throws UpgradeFailureException.

At that point the deployment gets stuck in a RECONCILING loop with 
UpgradeFailureException thrown on each cycle. I've attached operator logs 
showing this.

 

I think the fix would be to wrap 
[https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L155]
 in an if-statement that checks the job is not in a terminal state. Happy to 
discuss or a put up a PR.


> FINISHED jobs incorrectly being set to RECONCILING
> --------------------------------------------------
>
>                 Key: FLINK-37320
>                 URL: https://issues.apache.org/jira/browse/FLINK-37320
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.10.0
>         Environment: I've attached the flinkdeployment CR and operator-config 
> I used to locally replicate.
>            Reporter: Luca Castelli
>            Priority: Minor
>         Attachments: 
> flink-kubernetes-operator-deploy-flink-kubernetes-operator-6f97d96777-8k2d4-1739457686217038000.log,
>  operator-config.yaml, test-batch-job.yaml
>
>
> Hello,
> I believe we've found a bug within the observation logic for finite streaming 
> or batch jobs. This is a follow-up to: 
> [https://lists.apache.org/thread/xvsk4fmlqln092cdolvox4dgko0pw81k].
>  # The job finishes successfully and the job status changes to FINISHED
>  # TTL (kubernetes.operator.jm-deployment.shutdown-ttl) cleanup removes the 
> JM deployments and clears HA configmap data
>  # On the next loop, the observer sees MISSING JM and changes the job status 
> from FINISHED to RECONCILING
> The job had reached a terminal state. It shouldn't have been set back to 
> RECONCILING.
> This leads to an operator error later when a recovery attempt is triggered. 
> The recovery is triggered because the JM is MISSING, the status is 
> RECONCILING, spec shows RUNNING, and HA enabled. The recovery fails with 
> validateHaMetadataExists throwing UpgradeFailureException.
> At that point the deployment gets stuck in a RECONCILING loop with 
> UpgradeFailureException thrown on each cycle. I've attached operator logs 
> showing this.
>  
> I think the fix would be to wrap 
> [https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/observer/deployment/AbstractFlinkDeploymentObserver.java#L155]
>  in an if-statement that checks the job is not in a terminal state. Happy to 
> discuss or a put up a PR.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-37320) FINISHED jobs incorrectly being set to RECONCILING

Reply via email to