[jira] [Updated] (FLINK-33481) Why were checkpoints stored on zookeeper deleted when JobManager failures with Flink High Availability on yarn

hansonhe (Jira) Wed, 08 Nov 2023 01:40:04 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-33481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


hansonhe updated FLINK-33481:
-----------------------------
    Affects Version/s: 1.13.1

> Why were checkpoints stored on zookeeper deleted when JobManager failures 
> with Flink High Availability on yarn
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33481
>                 URL: https://issues.apache.org/jira/browse/FLINK-33481
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.13.1
>            Reporter: hansonhe
>            Priority: Major
>         Attachments: image-2023-11-08-09-40-59-889.png, 
> image-2023-11-08-09-57-17-739.png, image-2023-11-08-10-05-54-694.png
>
>
> FlinkVersion:  1.13.1
> (1) flink-conf.yaml 
> high-availability.zookeeper.path.root    /flink
> high-availability.zookeeper.quorum   xxxxx
> state.checkpoint-storage    filesystem
> state.checkpoints.dir   hdfs://xxxxx
> (2) jobmanager
> application_1684323088373_1744
> jm_1: appattempt_1684323088373_1744_000001    Tue Oct 31 11:19:07 +0800 2023
> jm_2: appattempt_1684323088373_1744_000002    Sat Nov 4 11:10:52 +0800 2023
> (3) When appattempt_1684323088373_1744_000001  failures, I found 
>    3.1）Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c 
> stored on hdfs is successful
>    3.2)  Checkpoint stored in zookeper: /flink/application_1684323088373_1744 
> was deleted
> the logs as following: 
> !image-2023-11-08-10-05-54-694.png!
> !image-2023-11-08-09-40-59-889.png!
> (4) After appattempt_1684323088373_1744_000001  failures, jobmanager switch 
> to start appattempt_1684323088373_1744_000002, the logs start as following:   
> No checkpoint found during restore  !image-2023-11-08-09-57-17-739.png!
> （5）My Question ：
>        5.1）Why were checkpoints stored on zookeeper deleted when JobManager 
> failures with Flink High Availability on yarn？It cause that  Jobmanager run 
> to restore  without checkpoint found
>        5.2）Why not directly to use successful and completed checkpoint-5750 
> stored on hdfs  to restore  when failed over to  
> jm_2:appattempt_1684323088373_1744_000002? But it still attempt to recover 
> from ZookeeperStateHandleStore firstly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-33481) Why were checkpoints stored on zookeeper deleted when JobManager failures with Flink High Availability on yarn

Reply via email to