[ https://issues.apache.org/jira/browse/FLINK-33481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
hansonhe updated FLINK-33481: ----------------------------- Affects Version/s: 1.13.1 > Why were checkpoints stored on zookeeper deleted when JobManager failures > with Flink High Availability on yarn > -------------------------------------------------------------------------------------------------------------- > > Key: FLINK-33481 > URL: https://issues.apache.org/jira/browse/FLINK-33481 > Project: Flink > Issue Type: Bug > Affects Versions: 1.13.1 > Reporter: hansonhe > Priority: Major > Attachments: image-2023-11-08-09-40-59-889.png, > image-2023-11-08-09-57-17-739.png, image-2023-11-08-10-05-54-694.png > > > FlinkVersion: 1.13.1 > (1) flink-conf.yaml > high-availability.zookeeper.path.root /flink > high-availability.zookeeper.quorum xxxxx > state.checkpoint-storage filesystem > state.checkpoints.dir hdfs://xxxxx > (2) jobmanager > application_1684323088373_1744 > jm_1: appattempt_1684323088373_1744_000001 Tue Oct 31 11:19:07 +0800 2023 > jm_2: appattempt_1684323088373_1744_000002 Sat Nov 4 11:10:52 +0800 2023 > (3) When appattempt_1684323088373_1744_000001 failures, I found > 3.1)Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c > stored on hdfs is successful > 3.2) Checkpoint stored in zookeper: /flink/application_1684323088373_1744 > was deleted > the logs as following: > !image-2023-11-08-10-05-54-694.png! > !image-2023-11-08-09-40-59-889.png! > (4) After appattempt_1684323088373_1744_000001 failures, jobmanager switch > to start appattempt_1684323088373_1744_000002, the logs start as following: > No checkpoint found during restore !image-2023-11-08-09-57-17-739.png! > (5)My Question : > 5.1)Why were checkpoints stored on zookeeper deleted when JobManager > failures with Flink High Availability on yarn?It cause that Jobmanager run > to restore without checkpoint found > 5.2)Why not directly to use successful and completed checkpoint-5750 > stored on hdfs to restore when failed over to > jm_2:appattempt_1684323088373_1744_000002? But it still attempt to recover > from ZookeeperStateHandleStore firstly. -- This message was sent by Atlassian Jira (v8.20.10#820010)