[ https://issues.apache.org/jira/browse/FLINK-33481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783866#comment-17783866 ]
hansonhe commented on FLINK-33481: ---------------------------------- @[~trohrmann] or AnyBody Can you or anybody help me to answer my questions. This jobmanager failedover was caused by zookeeper cluster server failures, and found some fllink jobs consumed kafka data all from earliest after jobmanager failed over > Why were checkpoints stored on zookeeper deleted when JobManager failures > with Flink High Availability on yarn > -------------------------------------------------------------------------------------------------------------- > > Key: FLINK-33481 > URL: https://issues.apache.org/jira/browse/FLINK-33481 > Project: Flink > Issue Type: Bug > Reporter: hansonhe > Priority: Major > Attachments: image-2023-11-08-09-40-59-889.png, > image-2023-11-08-09-57-17-739.png, image-2023-11-08-10-05-54-694.png > > > FlinkVersion: 1.13.5 , > (1) flink-conf.yaml > high-availability.zookeeper.path.root /flink > high-availability.zookeeper.quorum xxxxx > state.checkpoint-storage filesystem > state.checkpoints.dir hdfs://xxxxx > (2) jobmanager > application_1684323088373_1744 > jm_1: appattempt_1684323088373_1744_000001 Tue Oct 31 11:19:07 +0800 2023 > jm_2: appattempt_1684323088373_1744_000002 Sat Nov 4 11:10:52 +0800 2023 > (3) When appattempt_1684323088373_1744_000001 failures, I found > 3.1)Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c > stored on hdfs is successful > 3.2) Checkpoint stored in zookeper: /flink/application_1684323088373_1744 > was deleted > the logs as following: > !image-2023-11-08-10-05-54-694.png! > !image-2023-11-08-09-40-59-889.png! > (4) After appattempt_1684323088373_1744_000001 failures, jobmanager switch > to start appattempt_1684323088373_1744_000002, the logs start as following: > No checkpoint found during restore !image-2023-11-08-09-57-17-739.png! > (5)My Question : > 5.1)Why were checkpoints stored on zookeeper deleted when JobManager > failures with Flink High Availability on yarn?It cause that Jobmanager run > to restore without checkpoint found > 5.2)Why not directly to use successful and completed checkpoint-5750 > stored on hdfs to restore when failed over to > jm_2:appattempt_1684323088373_1744_000002? But it still attempt to recover > from ZookeeperStateHandleStore firstly. -- This message was sent by Atlassian Jira (v8.20.10#820010)