[jira] [Updated] (FLINK-33481) Why were checkpoints stored on zookeeper deleted when JobManager failures with Flink High Availability on yarn

hansonhe (Jira) Tue, 07 Nov 2023 18:16:04 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-33481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


hansonhe updated FLINK-33481:
-----------------------------
    Description: 
FlinkVersion:  1.13.5 , 
(1) flink-conf.yaml 

high-availability.zookeeper.path.root    /flink
high-availability.zookeeper.quorum   xxxxx
state.checkpoint-storage    filesystem
state.checkpoints.dir   hdfs://xxxxx
(2) jobmanager

application_1684323088373_1744
appattempt_1684323088373_1744_000001    Tue Oct 31 11:19:07 +0800 2023
appattempt_1684323088373_1744_000002    Sat Nov 4 11:10:52 +0800 2023

(3) When appattempt_1684323088373_1744_000001  failures, I found 
   3.1）Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c 
stored on hdfs is successful
   3.2) checkpoint stored in zookeper: /flink/application_1684323088373_1744 
was deleted

the logs as following: 
!image-2023-11-08-10-05-54-694.png!
!image-2023-11-08-09-40-59-889.png!

(4) After appattempt_1684323088373_1744_000001  failures, jobmanager switch to 
start appattempt_1684323088373_1744_000002, the logs start as following:   No 
checkpoint found during restore  !image-2023-11-08-09-57-17-739.png!

（5）My Question ：
       5.1）Why were checkpoints stored on zookeeper deleted when JobManager 
failures with Flink High Availability on yarn？It cause that  Jobmanager run to 
restore  without checkpoint found

       5.2）Why not  use  successful and completed checkpoint-5750 stored on 
hdfs  to restore when running  jobmanager appattempt_1684323088373_1744_000002?

  was:
FlinkVersion:  1.13.5 , 
(1) flink-conf.yaml 

high-availability.zookeeper.path.root    /flink
high-availability.zookeeper.quorum   xxxxx
(2) jobmanager

application_1684323088373_1744
appattempt_1684323088373_1744_000001    Tue Oct 31 11:19:07 +0800 2023
appattempt_1684323088373_1744_000002    Sat Nov 4 11:10:52 +0800 2023

(3) When appattempt_1684323088373_1744_000001  failures, I found 
   3.1）Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c 
stored on hdfs is successful
   3.2) checkpoint stored in zookeper: /flink/application_1684323088373_1744 
was deleted

the logs as following: 
!image-2023-11-08-10-05-54-694.png!
!image-2023-11-08-09-40-59-889.png!

(4) After appattempt_1684323088373_1744_000001  failures, jobmanager switch to 
start appattempt_1684323088373_1744_000002, the logs start as following:   No 
checkpoint found during restore  !image-2023-11-08-09-57-17-739.png!

（5）My Question ：
       5.1）Why were checkpoints stored on zookeeper deleted when JobManager 
failures with Flink High Availability on yarn？It cause that  Jobmanager run to 
restore  without checkpoint found

       5.2）Why not  use  successful and completed checkpoint-5750 stored on 
hdfs  to restore when running  jobmanager appattempt_1684323088373_1744_000002?


> Why were checkpoints stored on zookeeper deleted when JobManager failures 
> with Flink High Availability on yarn
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33481
>                 URL: https://issues.apache.org/jira/browse/FLINK-33481
>             Project: Flink
>          Issue Type: Bug
>            Reporter: hansonhe
>            Priority: Major
>         Attachments: image-2023-11-08-09-40-59-889.png, 
> image-2023-11-08-09-57-17-739.png, image-2023-11-08-10-05-54-694.png
>
>
> FlinkVersion:  1.13.5 , 
> (1) flink-conf.yaml 
> high-availability.zookeeper.path.root    /flink
> high-availability.zookeeper.quorum   xxxxx
> state.checkpoint-storage    filesystem
> state.checkpoints.dir   hdfs://xxxxx
> (2) jobmanager
> application_1684323088373_1744
> appattempt_1684323088373_1744_000001    Tue Oct 31 11:19:07 +0800 2023
> appattempt_1684323088373_1744_000002    Sat Nov 4 11:10:52 +0800 2023
> (3) When appattempt_1684323088373_1744_000001  failures, I found 
>    3.1）Completed checkpoint 5750 for job 6262e8c6a072027459f9b4eeb3e9735c 
> stored on hdfs is successful
>    3.2) checkpoint stored in zookeper: /flink/application_1684323088373_1744 
> was deleted
> the logs as following: 
> !image-2023-11-08-10-05-54-694.png!
> !image-2023-11-08-09-40-59-889.png!
> (4) After appattempt_1684323088373_1744_000001  failures, jobmanager switch 
> to start appattempt_1684323088373_1744_000002, the logs start as following:   
> No checkpoint found during restore  !image-2023-11-08-09-57-17-739.png!
> （5）My Question ：
>        5.1）Why were checkpoints stored on zookeeper deleted when JobManager 
> failures with Flink High Availability on yarn？It cause that  Jobmanager run 
> to restore  without checkpoint found
>        5.2）Why not  use  successful and completed checkpoint-5750 stored on 
> hdfs  to restore when running  jobmanager 
> appattempt_1684323088373_1744_000002?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-33481) Why were checkpoints stored on zookeeper deleted when JobManager failures with Flink High Availability on yarn

Reply via email to