[ https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451072#comment-17451072 ]
Adrian Vasiliu edited comment on FLINK-25098 at 11/30/21, 11:36 AM: -------------------------------------------------------------------- Yes, since then we also identified the K8S configmaps as being the cause. The scenario is: 1. Flink cluster deployed and receiving a Flink job. All good. 2. Uninstall - all K8S objects go away, except the Flink configmaps. 3. Reinstall => crashloopbackoff. I hear now from colleagues that the issue with Flink CMs being left behind at uninstall time has already been raised, see https://lists.apache.org/thread/ml9dp9jqytnn303wypqoor7b32o1y32y. Your take on it? was (Author: JIRAUSER280892): Yes, since then we also identified the K8S configmaps as being the cause. The scenario is: 1. Flink cluster deployed and receiving a Flink job. All good. 2. Uninstall - all K8S objects go away, except the Flink configmaps. 3. Reinstall => crashloopbackoff. I hear now from colleagues that the issue with Flink CMs being left behind at uninstall time has already been raised on the user list. Your take on it? > Jobmanager CrashLoopBackOff in HA configuration > ----------------------------------------------- > > Key: FLINK-25098 > URL: https://issues.apache.org/jira/browse/FLINK-25098 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.13.2, 1.13.3 > Environment: Reproduced with: > * Persistent jobs storage provided by the rocks-cephfs storage class. > * OpenShift 4.9.5. > Reporter: Adrian Vasiliu > Priority: Critical > Attachments: jm-flink-ha-jobmanager-log.txt, > jm-flink-ha-tls-proxy-log.txt > > > In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink > 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to > CrashLoopBackoff for all replicas. > Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of > jobmanager pod: > [^jm-flink-ha-jobmanager-log.txt] > [^jm-flink-ha-tls-proxy-log.txt] > Reproduced with: > * Persistent jobs storage provided by the {{rocks-cephfs}} storage class > (shared by all replicas - ReadWriteMany) and mount path set via > {{{}high-availability.storageDir: file///<dir>{}}}. > * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not > a "one-shot" trouble. > Remarks: > * This is a follow-up of > https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524. > > * Picked Critical severity as HA is critical for our product. -- This message was sent by Atlassian Jira (v8.20.1#820001)