tweise commented on PR #356: URL: https://github.com/apache/flink-kubernetes-operator/pull/356#issuecomment-1230267838
> I think only checking the existence of stable spec is too dangerous. The job can start up, take 2 checkpoints and go in a failure loop without the operator ever seeing it "stable". > > Another problem with this approach is I think it introduces a bug with the `initialSavepointPath` logic. That field is only used in first deployment scenarios so with this change, it would be completely ignored the second time you try to fix your never running job. Good catch. Looks like we are lacking test coverage for that. > > I fixed a similar issue in this commit: [ac21bc8](https://github.com/apache/flink-kubernetes-operator/commit/ac21bc8fe148f6dc803988791224f792a66875ce) > > I think a slightly different approach would be a bit better: > > 1. Observer somehow marks the status as initial failure. It has to confirm that the jobmanager never actually started somehow. > 2. In Reconciler if we get an upgrade and have initial failure, delete deployment & use `ReconciliationUtils.clearLastReconciledSpecIfFirstDeploy(flinkApp);`, reschedulre reconcile loop with 0 (like mid upgrade) The reset process would need to be triggered by the spec change, not by observation of the pending deployment. What prompts the reset is that there is a spec change while the initial deployment has not been "reconciled". The pending deployment may succeed or not, operator cannot determine that. (Like in the situations that lead to the discovery of this issue, image pull error and entry point failure). We can probably tighten the last spec change by also checking for presence of HA metadata. (`FlinkUtils.isKubernetesHAActivated(deployConfig) && FlinkUtils.isKubernetesHAActivated(observeConfig) && flinkService.isHaMetadataAvailable(deployConfig)`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org