[jira] [Commented] (FLINK-33481) Why were checkpoints stored on zookeeper deleted when JobManager failures with Flink High Availability on yarn

hansonhe (Jira) Tue, 07 Nov 2023 18:56:06 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783866#comment-17783866
 ]


hansonhe commented on FLINK-33481:
----------------------------------

@[~trohrmann]  or AnyBody
Can you or anybody help me to answer my questions. This jobmanager failedover 
was caused by zookeeper cluster server failures, and found some fllink jobs  
consumed kafka data all from earliest after jobmanager failed over

> Why were checkpoints stored on zookeeper deleted when JobManager failures 
> with Flink High Availability on yarn
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33481
>                 URL: https://issues.apache.org/jira/browse/FLINK-33481
>             Project: Flink
>          Issue Type: Bug
>            Reporter: hansonhe
>            Priority: Major
>         Attachments: image-2023-11-08-09-40-59-889.png, 
> image-2023-11-08-09-57-17-739.png, image-2023-11-08-10-05-54-694.png
>
>
> FlinkVersion:  1.13.5 , 
> (1) flink-conf.yaml 
> high-availability.zookeeper.path.root    /flink
> high-availability.zookeeper.quorum   xxxxx
> state.checkpoint-storage    filesystem
> state.checkpoints.dir   hdfs://xxxxx
> (2) jobmanager
> application_1684323088373_1744
> jm_1: appattempt_1684323088373_1744_000001    Tue Oct 31 11:19:07 +0800 2023
> jm_2: appattempt_1684323088373_1744_000002    Sat Nov 4 11:10:52 +0800 2023
> (3) When appattempt_1684323088373_1744_000001  failures, I found 
>    3.1）Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c 
> stored on hdfs is successful
>    3.2)  Checkpoint stored in zookeper: /flink/application_1684323088373_1744 
> was deleted
> the logs as following: 
> !image-2023-11-08-10-05-54-694.png!
> !image-2023-11-08-09-40-59-889.png!
> (4) After appattempt_1684323088373_1744_000001  failures, jobmanager switch 
> to start appattempt_1684323088373_1744_000002, the logs start as following:   
> No checkpoint found during restore  !image-2023-11-08-09-57-17-739.png!
> （5）My Question ：
>        5.1）Why were checkpoints stored on zookeeper deleted when JobManager 
> failures with Flink High Availability on yarn？It cause that  Jobmanager run 
> to restore  without checkpoint found
>        5.2）Why not directly to use successful and completed checkpoint-5750 
> stored on hdfs  to restore  when failed over to  
> jm_2:appattempt_1684323088373_1744_000002? But it still attempt to recover 
> from ZookeeperStateHandleStore firstly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-33481) Why were checkpoints stored on zookeeper deleted when JobManager failures with Flink High Availability on yarn

Reply via email to