Hi Zhinan, For the time to detect the failure, you could refer to the time when 'fullRestarts' increase. That could give you information about the time of job failure.
For the checkpoint recovery time, there actually exist two parts: 1. The time to read checkpoint meta in JM. However, this duration of time has no explicit metrics currently as that part of duration would be nearly just reading 1 MB file remotely from DFS. 2. The time for tasks to restore state. This should be treated as the real time for checkpoint recovery and could even be 10 minutes+ when restoring savepoint. Unfortunately, this part of time is also not recorded in metrics now. If you find the task is in RUNNING state but not consume any record, that might be stuck in restoring checkpoint/savepoint. Best Yun Tang ________________________________ From: Zhinan Cheng <chingchi...@gmail.com> Sent: Tuesday, August 18, 2020 11:50 To: user@flink.apache.org <user@flink.apache.org> Subject: Flink checkpoint recovery time Hi all, I am working on measuring the failure recovery time of Flink and I want to decompose the recovery time into different parts, say the time to detect the failure, the time to restart the job, and the time to restore the checkpointing. I found that I can measure the down time during failure and the time to restart the job and some metric for the checkpointing as below. [measure.png] Unfortunately, I cannot find any information about the failure detect time and checkpoint recovery time, Is there any way that Flink has provided for this, otherwise, how can I solve this? Thanks a lot for your help. Regards,