[ https://issues.apache.org/jira/browse/FLINK-33481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785434#comment-17785434 ]
Matthias Pohl commented on FLINK-33481: --------------------------------------- [~hansonhe] Flink 1.13.1 reached its end-of-life already quite some time ago. I agree with you that it looks suspicious and you conclusion based on the logs you shared is correct. But it's quite tedious to investigate whether there are bugfixes for this specific issue in later versions. Can you reproduce the issue? And if yes, could you try running the scenario with newer versions of Flink to see whether it's reproducible in those versions like Flink 1.18 as well? > Why were checkpoints stored on zookeeper deleted when JobManager failures > with Flink High Availability on yarn > -------------------------------------------------------------------------------------------------------------- > > Key: FLINK-33481 > URL: https://issues.apache.org/jira/browse/FLINK-33481 > Project: Flink > Issue Type: Bug > Affects Versions: 1.13.1 > Reporter: hansonhe > Priority: Major > Attachments: image-2023-11-08-09-40-59-889.png, > image-2023-11-08-09-57-17-739.png, image-2023-11-08-10-05-54-694.png > > > FlinkVersion: 1.13.1 > (1) flink-conf.yaml > high-availability.zookeeper.path.root /flink > high-availability.zookeeper.quorum xxxxx > state.checkpoint-storage filesystem > state.checkpoints.dir hdfs://xxxxx > (2) jobmanager > application_1684323088373_1744 > jm_1: appattempt_1684323088373_1744_000001 Tue Oct 31 11:19:07 +0800 2023 > jm_2: appattempt_1684323088373_1744_000002 Sat Nov 4 11:10:52 +0800 2023 > (3) When appattempt_1684323088373_1744_000001 failures, I found > 3.1)Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c > stored on hdfs is successful > 3.2) Checkpoint stored in zookeper: /flink/application_1684323088373_1744 > was deleted > the logs as following: > !image-2023-11-08-10-05-54-694.png! > !image-2023-11-08-09-40-59-889.png! > (4) After appattempt_1684323088373_1744_000001 failures, jobmanager switch > to start appattempt_1684323088373_1744_000002, the logs start as following: > No checkpoint found during restore !image-2023-11-08-09-57-17-739.png! > (5)My Question : > 5.1)Why were checkpoints stored on zookeeper deleted when JobManager > failures with Flink High Availability on yarn?It cause that Jobmanager run > to restore without checkpoint found > 5.2)Why not directly to use successful and completed checkpoint-5750 > stored on hdfs to restore when failed over to > jm_2:appattempt_1684323088373_1744_000002? But it still attempt to recover > from ZookeeperStateHandleStore firstly. -- This message was sent by Atlassian Jira (v8.20.10#820010)