[jira] [Commented] (FLINK-33481) Why were checkpoints stored on zookeeper deleted when JobManager failures with Flink High Availability on yarn

Matthias Pohl (Jira) Mon, 13 Nov 2023 01:47:15 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17785434#comment-17785434
 ]


Matthias Pohl commented on FLINK-33481:
---------------------------------------

[~hansonhe] Flink 1.13.1 reached its end-of-life already quite some time ago. I 
agree with you that it looks suspicious and you conclusion based on the logs 
you shared is correct. But it's quite tedious to investigate whether there are 
bugfixes for this specific issue in later versions. Can you reproduce the 
issue? And if yes, could you try running the scenario with newer versions of 
Flink to see whether it's reproducible in those versions like Flink 1.18 as 
well?

> Why were checkpoints stored on zookeeper deleted when JobManager failures 
> with Flink High Availability on yarn
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33481
>                 URL: https://issues.apache.org/jira/browse/FLINK-33481
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.13.1
>            Reporter: hansonhe
>            Priority: Major
>         Attachments: image-2023-11-08-09-40-59-889.png, 
> image-2023-11-08-09-57-17-739.png, image-2023-11-08-10-05-54-694.png
>
>
> FlinkVersion:  1.13.1
> (1) flink-conf.yaml 
> high-availability.zookeeper.path.root    /flink
> high-availability.zookeeper.quorum   xxxxx
> state.checkpoint-storage    filesystem
> state.checkpoints.dir   hdfs://xxxxx
> (2) jobmanager
> application_1684323088373_1744
> jm_1: appattempt_1684323088373_1744_000001    Tue Oct 31 11:19:07 +0800 2023
> jm_2: appattempt_1684323088373_1744_000002    Sat Nov 4 11:10:52 +0800 2023
> (3) When appattempt_1684323088373_1744_000001  failures, I found 
>    3.1）Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c 
> stored on hdfs is successful
>    3.2)  Checkpoint stored in zookeper: /flink/application_1684323088373_1744 
> was deleted
> the logs as following: 
> !image-2023-11-08-10-05-54-694.png!
> !image-2023-11-08-09-40-59-889.png!
> (4) After appattempt_1684323088373_1744_000001  failures, jobmanager switch 
> to start appattempt_1684323088373_1744_000002, the logs start as following:   
> No checkpoint found during restore  !image-2023-11-08-09-57-17-739.png!
> （5）My Question ：
>        5.1）Why were checkpoints stored on zookeeper deleted when JobManager 
> failures with Flink High Availability on yarn？It cause that  Jobmanager run 
> to restore  without checkpoint found
>        5.2）Why not directly to use successful and completed checkpoint-5750 
> stored on hdfs  to restore  when failed over to  
> jm_2:appattempt_1684323088373_1744_000002? But it still attempt to recover 
> from ZookeeperStateHandleStore firstly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-33481) Why were checkpoints stored on zookeeper deleted when JobManager failures with Flink High Availability on yarn

Reply via email to