Urs Schoenenberger created FLINK-38290: ------------------------------------------
Summary: Application cluster: FINISHED FlinkDeployment falls back to RECONCILING if JM restarts Key: FLINK-38290 URL: https://issues.apache.org/jira/browse/FLINK-38290 Project: Flink Issue Type: Bug Components: Client / Job Submission, Deployment / Kubernetes Affects Versions: kubernetes-operator-1.12.1, 1.20.1 Reporter: Urs Schoenenberger Hi folks, we are encountering the following issue, and I believe it's a bug or a missing feature. Steps to reproduce: * Deploy the example FlinkDeployment ( [https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.12/examples/basic.yaml] ) with a bounded job (e.g. examples/streaming/WordCount.jar) and configure high-availability.type: "kubernetes" and a high-availability.storageDir. * Wait for the FlinkDeployment to reach FINISHED. * Kill the JobManager pod. Observed behaviour: * A new JobManager is started. * The new pod checks the HA dir and realizes that the job is already completed. Log from StandaloneDispatcher: "Ignoring JobGraph submission (...) because the job already reached a globally-terminal state (...). * The operator tries to reconcile the job. In JobStatusObserver, it queries the JobManager's REST API (/jobs/overview), but it receives a "not found". ** This is because the backend here does not check the HA store, but the JobStore instead. This is backed by RAM or a local file, so it is not recovered on JM restart. * This leads the k8s operator to believe something is wrong with the FlinkDeployment, and the FlinkDeployment goes back to state RECONCILING and gets stuck there. This messes with monitoring and alerting among other things. Am I missing a particular piece of configuration to make this work? -- This message was sent by Atlassian Jira (v8.20.10#820010)