subject:"Re\: Flink checkpoint recovery time"

Re: Flink checkpoint recovery time

2020-08-21 Thread Till Rohrmann

It should be the akka.ask.timeout which is defining the rpc timeout. You can decrease it, but it might cause other RPCs to fail if you set it too low. Cheers, Till On Fri, Aug 21, 2020 at 9:45 AM Zhinan Cheng wrote: > Hi Till, > > Thanks for the reply. > Is the timeout 10s here always necessary

Re: Flink checkpoint recovery time

2020-08-21 Thread Zhinan Cheng

Hi Till, Thanks for the reply. Is the timeout 10s here always necessary? Can I reduce this value to reduce the restart time of the job? I cannot find this term in the configuration of Flink currently. Regards, Zhinan On Fri, 21 Aug 2020 at 15:28, Till Rohrmann wrote: > You are right. The prob

Re: Flink checkpoint recovery time

2020-08-21 Thread Till Rohrmann

You are right. The problem is that Flink tries three times to cancel the call and every RPC call has a timeout of 10s. Since the machine on which the Task ran has died, it will take that long until the system decides to fail the Task instead [1]. [1] https://github.com/apache/flink/blob/master/fli

Re: Flink checkpoint recovery time

2020-08-20 Thread Zhinan Cheng

Hi Till, Thanks for the quick reply. Yes, the job actually restarts twice, the metric fullRestarts also indicates this, its value is 2. But the job indeed takes around 30s to switch from CANCELLING to RESTARTING in its first restart. I just wonder why it takes so long here? Also, even I set the

Re: Flink checkpoint recovery time

2020-08-20 Thread Till Rohrmann

Hi Zhinan, the logs show that the cancellation does not take 30s. What happens is that the job gets restarted a couple of times. The problem seems to be that one TaskManager died permanently but it takes the heartbeat timeout (default 50s) until it is detected as dead. In the meantime the system t

Re: Flink checkpoint recovery time

2020-08-19 Thread Till Rohrmann

Could you share the logs with us? This might help to explain why the cancellation takes so long. Flink is no longer using Akka's death watch mechanism. Cheers, Till On Wed, Aug 19, 2020 at 10:01 AM Zhinan Cheng wrote: > Hi Till, > > Thanks for the quick response. > > > for i) the cancellation d

Re: Flink checkpoint recovery time

2020-08-19 Thread Zhinan Cheng

Hi Till, Thanks for the quick response. > for i) the cancellation depends on the user code. If the user code does a blocking operation, Flink needs to wait until it returns from there before it can move the Task's state to CANCELED. for this, my code just includes a map operation and then aggrega

Re: Flink checkpoint recovery time

2020-08-19 Thread Till Rohrmann

Hi Zhinan, for i) the cancellation depends on the user code. If the user code does a blocking operation, Flink needs to wait until it returns from there before it can move the Task's state to CANCELED. for ii) I think your observation is correct. Could you please open a JIRA issue for this proble

Re: Flink checkpoint recovery time

2020-08-18 Thread Zhinan Cheng

Hi Yun, Thanks a lot for your help. Seems hard to measure the checkpointing restore time currently. I do monitor the "fullRestarts" metric and others like "uptime" and "downtime" to observe some information about failure recovery. Still some confusions: i） I found the time for the jobmanager to m

Re: Flink checkpoint recovery time

2020-08-18 Thread Yun Tang

Hi Zhinan, For the time to detect the failure, you could refer to the time when 'fullRestarts' increase. That could give you information about the time of job failure. For the checkpoint recovery time, there actually exist two parts: 1. The time to read checkpoint meta in JM. However, this

Re: Flink checkpoint recovery time

Re: Flink checkpoint recovery time

Re: Flink checkpoint recovery time

Re: Flink checkpoint recovery time

Re: Flink checkpoint recovery time

Re: Flink checkpoint recovery time

Re: Flink checkpoint recovery time

Re: Flink checkpoint recovery time

Re: Flink checkpoint recovery time

Re: Flink checkpoint recovery time

10 matches

Site Navigation

Mail list logo

Footer information