Hi, > I want to decompose the recovery time into different parts, say > (1) the time to detect the failure, > (2) the time to restart the job, > (3) and the time to restore the checkpointing.
1. Maybe I'm missing something, but as far as I can tell, Flink can not help you with that. Time to detect the failure, would be a time between the failure occurred, and the time when JobManager realises about this failure. If we could reliably measure/check when the first one happened, then we could immediately trigger failover. You are interested in this exactly because there is no reliable way to detect the failure immediately. You could approximate this via analysing the logs. 2. Maybe there are some metrics that you could use, if not you check use the REST API [1] to monitor for the job status. Again you could also do it via analysing the logs. 3. In the future this might be measurable using the REST API (similar as the point 2.), but currently there is no way to do it that way. There is a ticket for that [2]. I think currently the only way is to do it is via analysing the logs. If you just need to do this once, I would analyse the logs manually. If you want to do it many times or monitor this continuously, I would write some simple script (python?) to mix checking REST API calls for 2. with logs analysing. Piotrek [1] https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs [2] https://issues.apache.org/jira/browse/FLINK-17012 wt., 18 sie 2020 o 04:07 Zhinan Cheng <znch...@cse.cuhk.edu.hk> napisaĆ(a): > Hi all, > > I am working on measuring the failure recovery time of Flink and I > want to decompose the recovery time into different parts, say the time > to detect the failure, the time to restart the job, and the time to > restore the checkpointing. > > Unfortunately, I cannot find any information in Flink doc to solve > this, Is there any way that Flink has provided for this, otherwise, > how can I solve this? > > Thanks a lot for your help. > > Regards, > Juno >