Re: Decompose failure recovery time

Piotr Nowojski Thu, 20 Aug 2020 06:27:19 -0700

Hi,

> I want to decompose the recovery time into different parts, say
> (1) the time to detect the failure,
> (2) the time to restart the job,
> (3) and the time to restore the checkpointing.


1. Maybe I'm missing something, but as far as I can tell, Flink can not
help you with that. Time to detect the failure, would be a time between the
failure occurred, and the time when JobManager realises about this failure.
If we could reliably measure/check when the first one happened, then we
could immediately trigger failover. You are interested in this exactly
because there is no reliable way to detect the failure immediately. You
could approximate this via analysing the logs.

2. Maybe there are some metrics that you could use, if not you check use
the REST API [1] to monitor for the job status. Again you could also do it
via analysing the logs.

3. In the future this might be measurable using the REST API (similar as
the point 2.), but currently there is no way to do it that way. There is a
ticket for that [2]. I think currently the only way is to do it is via
analysing the logs.

If you just need to do this once, I would analyse the logs manually. If you
want to do it many times or monitor this continuously, I would write some
simple script (python?) to mix checking REST API calls for 2. with logs
analysing.

Piotrek


[1]
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
[2] https://issues.apache.org/jira/browse/FLINK-17012
wt., 18 sie 2020 o 04:07 Zhinan Cheng <znch...@cse.cuhk.edu.hk> napisał(a):

> Hi all,
>
> I am working on measuring the failure recovery time of Flink and I
> want to decompose the recovery time into different parts, say the time
> to detect the failure, the time to restart the job, and the time to
> restore the checkpointing.
>
> Unfortunately, I cannot find  any information in Flink doc to solve
> this, Is there any way that Flink has provided for this, otherwise,
> how can I solve this?
>
> Thanks a lot for your help.
>
> Regards,
> Juno
>

Re: Decompose failure recovery time

Reply via email to