[ https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452441#comment-17452441 ]
Enrique Lacal commented on FLINK-25098: --------------------------------------- Hi Till, I want to differentiate between a reinstall and this occurring whilst the Flink Cluster is running. I agree that to uninstall everything it is safer to manually delete the ConfigMaps and this is what Adrian has seen above with them pointing to an outdated job graph. Deleting the ConfigMaps solve this issue, but not the latter. The other problem we are seeing is that whilst a Flink Cluster is running the Configmap for one of the jobs becomes inconsistent for some reason and when the leader JM goes down and a new follower tries to restore that job using the ConfigMap it cannot find the checkpoint referenced. (This is what Neeraj has shared through the logs) I've done some investigation and it seems that Flink updates the ConfigMap before creating the `completedCheckpoint` folder in the dir set by `high-availability.storageDir`. I watched the ConfigMap and the Fs simultaneously. My assumption is that before the `completedCheckpoint` is written but after the ConfigMap is updated the leader JM goes down for some reason and the state becomes inconsistent. Then Flink cannot recover from this state without manual intervention, which is a significant problem. Another idea might be that the checkpoint fails but the CM is prematurely updated. I think this is less likely. I understand you want to see some logs on when the checkpoint fails to understand the root cause, so we have set up persistence for our logs and will share those as soon as we can reproduce this issue. I also couldn't find a way of reproducing this by crashing the job, killing the leader pod etc.. and since the interval between the CM being updated and then the file in FS being created is too short it's hard to crash the pod at a specific time. From my observation after the Flink Cluster is in this unrecoverable state, the actual checkpoint is stored in `state.checkpoints.dir` such as `chk-<number>` but the `completeCheckpoint` doesn't exist which mean that the checkpoint has been taken correctly but the reference to it is not there. I believe this is the way it works, but not 100% sure. Just in case, these are the parameters used for checkpointing: |Checkpointing Mode|Exactly Once| |Checkpoint Storage|FileSystemCheckpointStorage| |State Backend|EmbeddedRocksDBStateBackend| |Interval|5s| |Timeout|10m 0s| |Minimum Pause Between Checkpoints|0ms| |Maximum Concurrent Checkpoints|1| |Unaligned Checkpoints|Disabled| |Persist Checkpoints Externally|Enabled (retain on cancellation)| |Tolerable Failed Checkpoints|0| Do you think there is a workaround for this issue? Maybe changing the above configuration to be less strict? Thanks for your time! > Jobmanager CrashLoopBackOff in HA configuration > ----------------------------------------------- > > Key: FLINK-25098 > URL: https://issues.apache.org/jira/browse/FLINK-25098 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.13.2, 1.13.3 > Environment: Reproduced with: > * Persistent jobs storage provided by the rocks-cephfs storage class. > * OpenShift 4.9.5. > Reporter: Adrian Vasiliu > Priority: Critical > Attachments: > iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log, > jm-flink-ha-jobmanager-log.txt, jm-flink-ha-tls-proxy-log.txt > > > In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink > 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to > CrashLoopBackoff for all replicas. > Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of > jobmanager pod: > [^jm-flink-ha-jobmanager-log.txt] > [^jm-flink-ha-tls-proxy-log.txt] > Reproduced with: > * Persistent jobs storage provided by the {{rocks-cephfs}} storage class > (shared by all replicas - ReadWriteMany) and mount path set via > {{{}high-availability.storageDir: file///<dir>{}}}. > * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not > a "one-shot" trouble. > Remarks: > * This is a follow-up of > https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524. > > * Picked Critical severity as HA is critical for our product. -- This message was sent by Atlassian Jira (v8.20.1#820001)