Re: Incorrect checkpoint id used when job is recovering

yuxia Thu, 19 May 2022 18:35:53 -0700

There's a simliar issue FLINK-19816[1] 

[1] [ https://issues.apache.org/jira/browse/FLINK-19816 | 
https://issues.apache.org/jira/browse/FLINK-19816 ]


Best regards, 
Yuxia 


发件人: "tao xiao" <xiaotao...@gmail.com> 
收件人: "User" <user@flink.apache.org> 
发送时间: 星期四, 2022年 5 月 19日 下午 9:16:34 
主题: Re: Incorrect checkpoint id used when job is recovering 

Hi team, 

Can anyone shed some light? 

On Sat, May 14, 2022 at 8:56 AM tao xiao < [ mailto:xiaotao...@gmail.com | 
xiaotao...@gmail.com ] > wrote: 



Hi team, 

Does anyone have any ideas? 

On Thu, May 12, 2022 at 9:20 PM tao xiao < [ mailto:xiaotao...@gmail.com | 
xiaotao...@gmail.com ] > wrote: 

BQ_BEGIN

Forgot to mention the Flink version is 1.13.2 and we use kubernetes native mode 

On Thu, May 12, 2022 at 9:18 PM tao xiao < [ mailto:xiaotao...@gmail.com | 
xiaotao...@gmail.com ] > wrote: 

BQ_BEGIN

Hi team, 
I met a weird issue when a job tries to recover from JM failure. The success 
checkpoint before JM crashed is 41205 

``` 
{"log":"2022-05-10 14:55:40,663 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
checkpoint 41205 for job 00000000000000000000000000000000 (9453840 bytes in 
1922 ms).\n","stream":"stdout","time":"2022-05-10T14:55:40.663286893Z"} 
``` 
However JM tries to recover the job with an old checkpoint 41051 which doesn't 
exist that leads to unrecoverable state 

``` 
"2022-05-10 14:59:38,949 INFO  
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying 
to retrieve checkpoint 41051.\n" 
``` 

Full log attached 

-- 
Regards, 
Tao 





-- 
Regards, 
Tao 

BQ_END



-- 
Regards, 
Tao 

BQ_END



-- 
Regards, 
Tao

Re: Incorrect checkpoint id used when job is recovering

Reply via email to