Zhenzhong Xu created FLINK-7844:
-----------------------------------

             Summary: Fine Grained Recovery triggers checkpoint timeout failure
                 Key: FLINK-7844
                 URL: https://issues.apache.org/jira/browse/FLINK-7844
             Project: Flink
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.3.2
            Reporter: Zhenzhong Xu


Context: 
We are using "individual" failover (fine-grained) recovery strategy for our 
embarrassingly parallel router use case. The topic has over 2000 partitions, 
and parallelism is set to ~180 that dispatched to over 20 task managers with 
around 180 slots.

We've noticed after one task manager termination, even though the individual 
recovery happens correctly, that the workload was re-dispatched to a new 
available task manager instance. However, the checkpoint would take 10 mins to 
eventually timeout, causing all other task managers not able to commit 
checkpoints. In a worst-case scenario, if job got restarted for other reasons 
(i.e. job manager termination), that would cause more messages to be 
re-processed/duplicates compared to the job without fine-grained recovery 
enabled.

I am suspecting that uber checkpoint was waiting for a previous checkpoint that 
initiated by the old task manager and thus taking a long time to time out.
Two questions:
1. Is there a configuration that controls this checkpoint timeout?
2. Is there any reason that when Job Manager realizes that Task Manager is gone 
and workload is redispatched, it still need to wait for the checkpoint 
initiated by the old task manager?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to