[ 
https://issues.apache.org/jira/browse/FLINK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213937#comment-16213937
 ] 

Till Rohrmann commented on FLINK-7844:
--------------------------------------

[~zhenzhongxu], the akka ask timeout is not important in order to detect dead 
TaskManagers. Instead the {{akka.watch.heartbeat.pause}} defines how fast 
Akka's watchdog detects that a {{TaskManager}} has died.

The checkpoint timeout defines how long a checkpoint can take after triggering 
before its cancelled. This value is independent of the heartbeat pause. It 
should also work if the checkpoint timeout is smaller than the heartbeat pause. 
In case of TM failure, this could then mean that another checkpoint is tried 
after the first one has timed out and before the TM is detected to be dead.

In general the checkpoint timeout should be defined so long that your 
application has a realistic chance to complete a triggered checkpoint. If the 
checkpoint only goes half through before it times out, then you should increase 
it.

The heartbeat pause should be configured such that you detect dead TMs fast but 
also not too low so that you get a lot of false positives due to network delays 
or GC.

> Fine Grained Recovery triggers checkpoint timeout failure
> ---------------------------------------------------------
>
>                 Key: FLINK-7844
>                 URL: https://issues.apache.org/jira/browse/FLINK-7844
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.4.0, 1.3.2
>            Reporter: Zhenzhong Xu
>            Assignee: Zhenzhong Xu
>         Attachments: screenshot-1.png
>
>
> Context: 
> We are using "individual" failover (fine-grained) recovery strategy for our 
> embarrassingly parallel router use case. The topic has over 2000 partitions, 
> and parallelism is set to ~180 that dispatched to over 20 task managers with 
> around 180 slots.
> Observations:
> We've noticed after one task manager termination, even though the individual 
> recovery happens correctly, that the workload was re-dispatched to a new 
> available task manager instance. However, the checkpoint would take 10 mins 
> to eventually timeout, causing all other task managers not able to commit 
> checkpoints. In a worst-case scenario, if job got restarted for other reasons 
> (i.e. job manager termination), that would cause more messages to be 
> re-processed/duplicates compared to the job without fine-grained recovery 
> enabled.
> I am suspecting that uber checkpoint was waiting for a previous checkpoint 
> that initiated by the old task manager and thus taking a long time to time 
> out.
> Two questions:
> 1. Is there a configuration that controls this checkpoint timeout?
> 2. Is there any reason that when Job Manager realizes that Task Manager is 
> gone and workload is redispatched, it still need to wait for the checkpoint 
> initiated by the old task manager?
> Checkpoint screenshot in attachments.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to