Re: Decompose failure recovery time

Zhinan Cheng Thu, 20 Aug 2020 10:38:22 -0700

Hi Piotr,

Thanks a lot.
I will try your suggestion to see what happen.


Regards,
Zhinan

On Fri, 21 Aug 2020 at 00:40, Piotr Nowojski <pnowoj...@apache.org> wrote:
>
> Hi Zhinan,
>
> It's hard to say, but my guess it takes that long for the tasks to respond to 
> cancellation which consists of a couple of steps. If a task is currently busy 
> processing something, it has to respond to interruption 
> (`java.lang.Thread#interrupt`). If it takes 30 seconds for a task to react to 
> the interruption and clean up it's resources, that can cause problems and 
> there is very little that Flink can do.
>
> If you want to debug it further, I would suggest collecting stack traces 
> during cancellation (or even better: profile the code during cancellation). 
> This would help you answer the question, what are the task threads busy with.
>
> Probably not a solution, but I'm mentioning it just in case, you can shorten 
> the `task.cancellation.timeout` period.  By default it's 180s. After that, 
> whole TaskManager will be killed. If you have spare TaskManagers or you can 
> restart them very quickly, lowering this timeout might help to some extent 
> (in an exchange for dirty shutdown, without cleaning up the resources).
>
> Piotrek
>
> czw., 20 sie 2020 o 18:00 Zhinan Cheng <znch...@cse.cuhk.edu.hk> napisał(a):
>>
>> Hi Piotr,
>>
>> Thanks a lot for your help.
>> Yes, I finally realize that I can only approximate the time for [1]
>> and [3] and measure [2] by monitoring the uptime and downtime metric
>> provided by Flink.
>>
>> And now my problem is that I found the time in [2] can be up to 40s, I
>> wonder why it takes so long to restart the job.
>> The log actually shows that the time to switch all operator instances
>> from CANCELING to CANCELED is around 30s, do you have any ideas about
>> this?
>>
>> Many thanks.
>>
>> Regards,
>> Zhinan
>>
>> On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <pnowoj...@apache.org> wrote:
>> >
>> > Hi,
>> >
>> > > I want to decompose the recovery time into different parts, say
>> > > (1) the time to detect the failure,
>> > > (2) the time to restart the job,
>> > > (3) and the time to restore the checkpointing.
>> >
>> > 1. Maybe I'm missing something, but as far as I can tell, Flink can not
>> > help you with that. Time to detect the failure, would be a time between the
>> > failure occurred, and the time when JobManager realises about this failure.
>> > If we could reliably measure/check when the first one happened, then we
>> > could immediately trigger failover. You are interested in this exactly
>> > because there is no reliable way to detect the failure immediately. You
>> > could approximate this via analysing the logs.
>> >
>> > 2. Maybe there are some metrics that you could use, if not you check use
>> > the REST API [1] to monitor for the job status. Again you could also do it
>> > via analysing the logs.
>> >
>> > 3. In the future this might be measurable using the REST API (similar as
>> > the point 2.), but currently there is no way to do it that way. There is a
>> > ticket for that [2]. I think currently the only way is to do it is via
>> > analysing the logs.
>> >
>> > If you just need to do this once, I would analyse the logs manually. If you
>> > want to do it many times or monitor this continuously, I would write some
>> > simple script (python?) to mix checking REST API calls for 2. with logs
>> > analysing.
>> >
>> > Piotrek
>> >
>> >
>> > [1]
>> > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
>> > [2] https://issues.apache.org/jira/browse/FLINK-17012
>> > wt., 18 sie 2020 o 04:07 Zhinan Cheng <znch...@cse.cuhk.edu.hk> napisał(a):
>> >
>> > > Hi all,
>> > >
>> > > I am working on measuring the failure recovery time of Flink and I
>> > > want to decompose the recovery time into different parts, say the time
>> > > to detect the failure, the time to restart the job, and the time to
>> > > restore the checkpointing.
>> > >
>> > > Unfortunately, I cannot find  any information in Flink doc to solve
>> > > this, Is there any way that Flink has provided for this, otherwise,
>> > > how can I solve this?
>> > >
>> > > Thanks a lot for your help.
>> > >
>> > > Regards,
>> > > Juno
>> > >
>>
>> On Thu, 20 Aug 2020 at 21:26, Piotr Nowojski <pnowoj...@apache.org> wrote:
>> >
>> > Hi,
>> >
>> > > I want to decompose the recovery time into different parts, say
>> > > (1) the time to detect the failure,
>> > > (2) the time to restart the job,
>> > > (3) and the time to restore the checkpointing.
>> >
>> > 1. Maybe I'm missing something, but as far as I can tell, Flink can not
>> > help you with that. Time to detect the failure, would be a time between the
>> > failure occurred, and the time when JobManager realises about this failure.
>> > If we could reliably measure/check when the first one happened, then we
>> > could immediately trigger failover. You are interested in this exactly
>> > because there is no reliable way to detect the failure immediately. You
>> > could approximate this via analysing the logs.
>> >
>> > 2. Maybe there are some metrics that you could use, if not you check use
>> > the REST API [1] to monitor for the job status. Again you could also do it
>> > via analysing the logs.
>> >
>> > 3. In the future this might be measurable using the REST API (similar as
>> > the point 2.), but currently there is no way to do it that way. There is a
>> > ticket for that [2]. I think currently the only way is to do it is via
>> > analysing the logs.
>> >
>> > If you just need to do this once, I would analyse the logs manually. If you
>> > want to do it many times or monitor this continuously, I would write some
>> > simple script (python?) to mix checking REST API calls for 2. with logs
>> > analysing.
>> >
>> > Piotrek
>> >
>> >
>> > [1]
>> > https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs
>> > [2] https://issues.apache.org/jira/browse/FLINK-17012
>> > wt., 18 sie 2020 o 04:07 Zhinan Cheng <znch...@cse.cuhk.edu.hk> napisał(a):
>> >
>> > > Hi all,
>> > >
>> > > I am working on measuring the failure recovery time of Flink and I
>> > > want to decompose the recovery time into different parts, say the time
>> > > to detect the failure, the time to restart the job, and the time to
>> > > restore the checkpointing.
>> > >
>> > > Unfortunately, I cannot find  any information in Flink doc to solve
>> > > this, Is there any way that Flink has provided for this, otherwise,
>> > > how can I solve this?
>> > >
>> > > Thanks a lot for your help.
>> > >
>> > > Regards,
>> > > Juno
>> > >

Re: Decompose failure recovery time

Reply via email to