Hi all,

I am working on measuring the failure recovery time of Flink and I want to
decompose the recovery time into different parts, say the time to detect
the failure, the time to restart the job, and the time to
restore the checkpointing.

I found that I can measure the down time during failure and the time to
restart the job and some metric for the checkpointing as below.

[image: measure.png]
Unfortunately, I cannot find any information about the failure detect time
and checkpoint recovery time, Is there any way that Flink has provided for
this, otherwise, how can I solve this?

Thanks a lot for your help.

Regards,

Reply via email to