Re: Decompose failure recovery time

2020-08-20 Thread Zhinan Cheng
Hi Piotr, Thanks a lot. I will try your suggestion to see what happen. Regards, Zhinan On Fri, 21 Aug 2020 at 00:40, Piotr Nowojski wrote: > > Hi Zhinan, > > It's hard to say, but my guess it takes that long for the tasks to respond to > cancellation which consists of a couple of steps. If a t

Re: Decompose failure recovery time

2020-08-20 Thread Piotr Nowojski
Hi Zhinan, It's hard to say, but my guess it takes that long for the tasks to respond to cancellation which consists of a couple of steps. If a task is currently busy processing something, it has to respond to interruption (`java.lang.Thread#interrupt`). If it takes 30 seconds for a task to react

Re: Decompose failure recovery time

2020-08-20 Thread Zhinan Cheng
Hi Piotr, Thanks a lot for your help. Yes, I finally realize that I can only approximate the time for [1] and [3] and measure [2] by monitoring the uptime and downtime metric provided by Flink. And now my problem is that I found the time in [2] can be up to 40s, I wonder why it takes so long to r

Re: Decompose failure recovery time

2020-08-20 Thread Piotr Nowojski
Hi, > I want to decompose the recovery time into different parts, say > (1) the time to detect the failure, > (2) the time to restart the job, > (3) and the time to restore the checkpointing. 1. Maybe I'm missing something, but as far as I can tell, Flink can not help you with that. Time to detec