Hi Zhinan,

For the time to detect the failure, you could refer to the time when 
'fullRestarts' increase. That could give you information about the time of job 
failure.

For the checkpoint recovery time, there actually exist two parts:

  1.  The time to read checkpoint meta in JM. However, this duration of time 
has no explicit metrics currently as that part of duration would be nearly just 
reading 1 MB file remotely from DFS.
  2.  The time for tasks to restore state. This should be treated as the real 
time for checkpoint recovery and could even be 10 minutes+ when restoring 
savepoint. Unfortunately, this part of time is also not recorded in metrics now.
If you find the task is in RUNNING state but not consume any record, that might 
be stuck in restoring checkpoint/savepoint.

Best
Yun Tang
________________________________
From: Zhinan Cheng <chingchi...@gmail.com>
Sent: Tuesday, August 18, 2020 11:50
To: user@flink.apache.org <user@flink.apache.org>
Subject: Flink checkpoint recovery time

Hi all,

I am working on measuring the failure recovery time of Flink and I want to 
decompose the recovery time into different parts, say the time to detect the 
failure, the time to restart the job, and the time to
restore the checkpointing.

I found that I can measure the down time during failure and the time to restart 
the job and some metric for the checkpointing as below.

[measure.png]
Unfortunately, I cannot find any information about the failure detect time and 
checkpoint recovery time, Is there any way that Flink has provided for this, 
otherwise, how can I solve this?

Thanks a lot for your help.

Regards,

Reply via email to