[ 
https://issues.apache.org/jira/browse/FLINK-33222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773606#comment-17773606
 ] 

Nicolas Fraison commented on FLINK-33222:
-----------------------------------------

So I was wrong the release 1.7-snapshot is not affected by this bug thanks to 
[https://github.com/apache/flink-kubernetes-operator/pull/681] patch.

Indeed deploying the app with an {{{}initialSavepointPath{}}}:
 * lastReconciledSpec get the update of generation from N to N+1 while stable 
spec generation stay at N. But no rollback detected as the 
[update|https://github.com/apache/flink-kubernetes-operator/pull/681/files#diff-29ea38a50cac5b4432dd0969bc3e2177e29a5507f8c7bb01b80f605a8740de41R169]
 is done after the 
[rollback|https://github.com/apache/flink-kubernetes-operator/pull/681/files#diff-29ea38a50cac5b4432dd0969bc3e2177e29a5507f8c7bb01b80f605a8740de41R146]
 check
deployment is consider as DEPLOYED

 * then on second reconcile loop the stable spec generation is also updated 
from N to N+1 (in 
[patchAndCacheStatus|[https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/controller/FlinkDeploymentController.java#L135]])and
 the deployment is consider as STABLE

But this look quite brittle to me as just changing the position of the 
shouldRollBack or ReconciliationUtils.updateReconciliationMetadata could lead 
to that bad behaviour again.

 

I'm wondering if we could not take in account the generation field in the 
[isLastReconciledSpecStable|https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator-api/src/main/java/org/apache/flink/kubernetes/operator/api/status/ReconciliationStatus.java#L91]

> Operator rollback app when it should not
> ----------------------------------------
>
>                 Key: FLINK-33222
>                 URL: https://issues.apache.org/jira/browse/FLINK-33222
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>         Environment: Flink operator 1.6 - Flink 1.17.1
>            Reporter: Nicolas Fraison
>            Priority: Major
>
> The operator can decide to rollback when an update of the job spec is 
> performed on 
> savepointTriggerNonce or initialSavepointPath if the app has been deployed 
> since more than KubernetesOperatorConfigOptions.DEPLOYMENT_READINESS_TIMEOUT.
>  
> This is due to the objectmeta generation being 
> [updated|https://github.com/apache/flink-kubernetes-operator/blob/release-1.6/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L169]
>  when changing those spec and leading to the lastReconcileSpec not being 
> aligned with the stableReconcileSpec while those spec are well ignored when 
> checking for upgrade diff
>  
> Looking at the main branch we should still face the same issue as the same 
> [update|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/reconciler/deployment/AbstractFlinkResourceReconciler.java#L169]
>  is performed at the end of the reconcile loop



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to