[ https://issues.apache.org/jira/browse/FLINK-33481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
hansonhe updated FLINK-33481: ----------------------------- Description: FlinkVersion: 1.13.5 , (1) flink-conf.yaml high-availability.zookeeper.path.root /flink high-availability.zookeeper.quorum xxxxx state.checkpoint-storage filesystem state.checkpoints.dir hdfs://xxxxx (2) jobmanager application_1684323088373_1744 appattempt_1684323088373_1744_000001 Tue Oct 31 11:19:07 +0800 2023 appattempt_1684323088373_1744_000002 Sat Nov 4 11:10:52 +0800 2023 (3) When appattempt_1684323088373_1744_000001 failures, I found 3.1)Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c stored on hdfs is successful 3.2) checkpoint stored in zookeper: /flink/application_1684323088373_1744 was deleted the logs as following: !image-2023-11-08-10-05-54-694.png! !image-2023-11-08-09-40-59-889.png! (4) After appattempt_1684323088373_1744_000001 failures, jobmanager switch to start appattempt_1684323088373_1744_000002, the logs start as following: No checkpoint found during restore !image-2023-11-08-09-57-17-739.png! (5)My Question : 5.1)Why were checkpoints stored on zookeeper deleted when JobManager failures with Flink High Availability on yarn?It cause that Jobmanager run to restore without checkpoint found 5.2)Why not use successful and completed checkpoint-5750 stored on hdfs to restore when running jobmanager appattempt_1684323088373_1744_000002? was: FlinkVersion: 1.13.5 , (1) flink-conf.yaml high-availability.zookeeper.path.root /flink high-availability.zookeeper.quorum xxxxx (2) jobmanager application_1684323088373_1744 appattempt_1684323088373_1744_000001 Tue Oct 31 11:19:07 +0800 2023 appattempt_1684323088373_1744_000002 Sat Nov 4 11:10:52 +0800 2023 (3) When appattempt_1684323088373_1744_000001 failures, I found 3.1)Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c stored on hdfs is successful 3.2) checkpoint stored in zookeper: /flink/application_1684323088373_1744 was deleted the logs as following: !image-2023-11-08-10-05-54-694.png! !image-2023-11-08-09-40-59-889.png! (4) After appattempt_1684323088373_1744_000001 failures, jobmanager switch to start appattempt_1684323088373_1744_000002, the logs start as following: No checkpoint found during restore !image-2023-11-08-09-57-17-739.png! (5)My Question : 5.1)Why were checkpoints stored on zookeeper deleted when JobManager failures with Flink High Availability on yarn?It cause that Jobmanager run to restore without checkpoint found 5.2)Why not use successful and completed checkpoint-5750 stored on hdfs to restore when running jobmanager appattempt_1684323088373_1744_000002? > Why were checkpoints stored on zookeeper deleted when JobManager failures > with Flink High Availability on yarn > -------------------------------------------------------------------------------------------------------------- > > Key: FLINK-33481 > URL: https://issues.apache.org/jira/browse/FLINK-33481 > Project: Flink > Issue Type: Bug > Reporter: hansonhe > Priority: Major > Attachments: image-2023-11-08-09-40-59-889.png, > image-2023-11-08-09-57-17-739.png, image-2023-11-08-10-05-54-694.png > > > FlinkVersion: 1.13.5 , > (1) flink-conf.yaml > high-availability.zookeeper.path.root /flink > high-availability.zookeeper.quorum xxxxx > state.checkpoint-storage filesystem > state.checkpoints.dir hdfs://xxxxx > (2) jobmanager > application_1684323088373_1744 > appattempt_1684323088373_1744_000001 Tue Oct 31 11:19:07 +0800 2023 > appattempt_1684323088373_1744_000002 Sat Nov 4 11:10:52 +0800 2023 > (3) When appattempt_1684323088373_1744_000001 failures, I found > 3.1)Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c > stored on hdfs is successful > 3.2) checkpoint stored in zookeper: /flink/application_1684323088373_1744 > was deleted > the logs as following: > !image-2023-11-08-10-05-54-694.png! > !image-2023-11-08-09-40-59-889.png! > (4) After appattempt_1684323088373_1744_000001 failures, jobmanager switch > to start appattempt_1684323088373_1744_000002, the logs start as following: > No checkpoint found during restore !image-2023-11-08-09-57-17-739.png! > (5)My Question : > 5.1)Why were checkpoints stored on zookeeper deleted when JobManager > failures with Flink High Availability on yarn?It cause that Jobmanager run > to restore without checkpoint found > 5.2)Why not use successful and completed checkpoint-5750 stored on > hdfs to restore when running jobmanager > appattempt_1684323088373_1744_000002? -- This message was sent by Atlassian Jira (v8.20.10#820010)