Re: Incorrect checkpoint id used when job is recovering

tao xiao Fri, 13 May 2022 17:56:53 -0700

Hi team,

Does anyone have any ideas?


On Thu, May 12, 2022 at 9:20 PM tao xiao <xiaotao...@gmail.com> wrote:

> Forgot to mention the Flink version is 1.13.2 and we use kubernetes native
> mode
>
> On Thu, May 12, 2022 at 9:18 PM tao xiao <xiaotao...@gmail.com> wrote:
>
>> Hi team,
>>
>> I met a weird issue when a job tries to recover from JM failure.  The
>> success checkpoint before JM crashed is 41205
>>
>> ```
>>
>> {"log":"2022-05-10 14:55:40,663 INFO  
>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
>> checkpoint 41205 for job 00000000000000000000000000000000 (9453840 bytes in 
>> 1922 ms).\n","stream":"stdout","time":"2022-05-10T14:55:40.663286893Z"}
>>
>> ```
>>
>> However JM tries to recover the job with an old checkpoint 41051 which
>> doesn't exist that leads to unrecoverable state
>>
>> ```
>>
>> "2022-05-10 14:59:38,949 INFO  
>> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - 
>> Trying to retrieve checkpoint 41051.\n"
>>
>> ```
>>
>> Full log attached
>>
>> --
>> Regards,
>> Tao
>>
>
>
> --
> Regards,
> Tao
>


-- 
Regards,
Tao

Re: Incorrect checkpoint id used when job is recovering

Reply via email to