Re: Decompose failure recovery time

Zhinan Cheng Thu, 20 Aug 2020 09:05:24 -0700

Hi Piotr,

Thanks a lot for your help.
Yes, I finally realize that I can only approximate the time for [1]
and [3] and measure [2] by monitoring the uptime and downtime metric
provided by Flink.


And now my problem is that I found the time in [2] can be up to 40s, I
wonder why it takes so long to restart the job.
The log actually shows that the time to switch all operator instances
from CANCELING to CANCELED is around 30s, do you have any ideas about
this?

Many thanks.

Regards,
Zhinan

On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <pnowoj...@apache.org> wrote:
>
> Hi,
>
> > I want to decompose the recovery time into different parts, say
> > (1) the time to detect the failure,
> > (2) the time to restart the job,
> > (3) and the time to restore the checkpointing.
>
> 1. Maybe I'm missing something, but as far as I can tell, Flink can not
> help you with that. Time to detect the failure, would be a time between the
> failure occurred, and the time when JobManager realises about this failure.
> If we could reliably measure/check when the first one happened, then we
> could immediately trigger failover. You are interested in this exactly
> because there is no reliable way to detect the failure immediately. You
> could approximate this via analysing the logs.
>
> 2. Maybe there are some metrics that you could use, if not you check use
> the REST API [1] to monitor for the job status. Again you could also do it
> via analysing the logs.
>
> 3. In the future this might be measurable using the REST API (similar as
> the point 2.), but currently there is no way to do it that way. There is a
> ticket for that [2]. I think currently the only way is to do it is via
> analysing the logs.
>
> If you just need to do this once, I would analyse the logs manually. If you
> want to do it many times or monitor this continuously, I would write some
> simple script (python?) to mix checking REST API calls for 2. with logs
> analysing.
>
> Piotrek
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
> [2] https://issues.apache.org/jira/browse/FLINK-17012
> wt., 18 sie 2020 o 04:07 Zhinan Cheng <znch...@cse.cuhk.edu.hk> napisał(a):
>
> > Hi all,
> >
> > I am working on measuring the failure recovery time of Flink and I
> > want to decompose the recovery time into different parts, say the time
> > to detect the failure, the time to restart the job, and the time to
> > restore the checkpointing.
> >
> > Unfortunately, I cannot find  any information in Flink doc to solve
> > this, Is there any way that Flink has provided for this, otherwise,
> > how can I solve this?
> >
> > Thanks a lot for your help.
> >
> > Regards,
> > Juno
> >

On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <pnowoj...@apache.org> wrote:
>
> Hi,
>
> > I want to decompose the recovery time into different parts, say
> > (1) the time to detect the failure,
> > (2) the time to restart the job,
> > (3) and the time to restore the checkpointing.
>
> 1. Maybe I'm missing something, but as far as I can tell, Flink can not
> help you with that. Time to detect the failure, would be a time between the
> failure occurred, and the time when JobManager realises about this failure.
> If we could reliably measure/check when the first one happened, then we
> could immediately trigger failover. You are interested in this exactly
> because there is no reliable way to detect the failure immediately. You
> could approximate this via analysing the logs.
>
> 2. Maybe there are some metrics that you could use, if not you check use
> the REST API [1] to monitor for the job status. Again you could also do it
> via analysing the logs.
>
> 3. In the future this might be measurable using the REST API (similar as
> the point 2.), but currently there is no way to do it that way. There is a
> ticket for that [2]. I think currently the only way is to do it is via
> analysing the logs.
>
> If you just need to do this once, I would analyse the logs manually. If you
> want to do it many times or monitor this continuously, I would write some
> simple script (python?) to mix checking REST API calls for 2. with logs
> analysing.
>
> Piotrek
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
> [2] https://issues.apache.org/jira/browse/FLINK-17012
> wt., 18 sie 2020 o 04:07 Zhinan Cheng <znch...@cse.cuhk.edu.hk> napisał(a):
>
> > Hi all,
> >
> > I am working on measuring the failure recovery time of Flink and I
> > want to decompose the recovery time into different parts, say the time
> > to detect the failure, the time to restart the job, and the time to
> > restore the checkpointing.
> >
> > Unfortunately, I cannot find  any information in Flink doc to solve
> > this, Is there any way that Flink has provided for this, otherwise,
> > how can I solve this?
> >
> > Thanks a lot for your help.
> >
> > Regards,
> > Juno
> >

Re: Decompose failure recovery time

Reply via email to