There's a simliar issue FLINK-19816[1] [1] [ https://issues.apache.org/jira/browse/FLINK-19816 | https://issues.apache.org/jira/browse/FLINK-19816 ]
Best regards, Yuxia 发件人: "tao xiao" <xiaotao...@gmail.com> 收件人: "User" <user@flink.apache.org> 发送时间: 星期四, 2022年 5 月 19日 下午 9:16:34 主题: Re: Incorrect checkpoint id used when job is recovering Hi team, Can anyone shed some light? On Sat, May 14, 2022 at 8:56 AM tao xiao < [ mailto:xiaotao...@gmail.com | xiaotao...@gmail.com ] > wrote: Hi team, Does anyone have any ideas? On Thu, May 12, 2022 at 9:20 PM tao xiao < [ mailto:xiaotao...@gmail.com | xiaotao...@gmail.com ] > wrote: BQ_BEGIN Forgot to mention the Flink version is 1.13.2 and we use kubernetes native mode On Thu, May 12, 2022 at 9:18 PM tao xiao < [ mailto:xiaotao...@gmail.com | xiaotao...@gmail.com ] > wrote: BQ_BEGIN Hi team, I met a weird issue when a job tries to recover from JM failure. The success checkpoint before JM crashed is 41205 ``` {"log":"2022-05-10 14:55:40,663 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 41205 for job 00000000000000000000000000000000 (9453840 bytes in 1922 ms).\n","stream":"stdout","time":"2022-05-10T14:55:40.663286893Z"} ``` However JM tries to recover the job with an old checkpoint 41051 which doesn't exist that leads to unrecoverable state ``` "2022-05-10 14:59:38,949 INFO org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 41051.\n" ``` Full log attached -- Regards, Tao -- Regards, Tao BQ_END -- Regards, Tao BQ_END -- Regards, Tao