Hi all, I am working on measuring the failure recovery time of Flink and I want to decompose the recovery time into different parts, say the time to detect the failure, the time to restart the job, and the time to restore the checkpointing.
I found that I can measure the down time during failure and the time to restart the job and some metric for the checkpointing as below. [image: measure.png] Unfortunately, I cannot find any information about the failure detect time and checkpoint recovery time, Is there any way that Flink has provided for this, otherwise, how can I solve this? Thanks a lot for your help. Regards,