Incorrect checkpoint id used when job is recovering

tao xiao Thu, 12 May 2022 06:19:45 -0700

Hi team,

I met a weird issue when a job tries to recover from JM failure.  The
success checkpoint before JM crashed is 41205


```

{"log":"2022-05-10 14:55:40,663 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Completed checkpoint 41205 for job 00000000000000000000000000000000
(9453840 bytes in 1922
ms).\n","stream":"stdout","time":"2022-05-10T14:55:40.663286893Z"}

```

However JM tries to recover the job with an old checkpoint 41051 which
doesn't exist that leads to unrecoverable state

```

"2022-05-10 14:59:38,949 INFO
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore []
- Trying to retrieve checkpoint 41051.\n"

```

Full log attached

-- 
Regards,
Tao

jm.log
Description: Binary data

Incorrect checkpoint id used when job is recovering

Reply via email to