Stephan Ewen created FLINK-1668:
-----------------------------------

             Summary: Add a config option to specify delays between restarts
                 Key: FLINK-1668
                 URL: https://issues.apache.org/jira/browse/FLINK-1668
             Project: Flink
          Issue Type: Improvement
    Affects Versions: 0.9
            Reporter: Stephan Ewen
            Assignee: Stephan Ewen
             Fix For: 0.9


The system currently introduces a short delay between a failed task execution 
and the restarted execution.

The reason is that this delay seemed to help in letting problems surface that 
let to the failed task. As an example, if a TaskManager fails, tasks fail due 
to data transfer errors. The TaskManager is not immediately recognized as 
failed, though (takes a bit until heartbeats time out). Immediately 
re-deploying tasks has a very high chance of assigning work to the TaskManager 
that is actually not responding, causing the execution retry to fail again. The 
delay gives the system time to figure out that the TaskManager was lost and 
does not take it into account upon the retry.

Currently, the system uses the heartbeat timeout as the default delay value. 
This may make sense as a default value for critical task failures, but is 
actually quite high for other types of failures.

In any case, I would like to add an option for users to specify the delay (even 
set it to 0, if desired).

The delay is not the best solution, in my opinion, we should eventually move to 
something better. Ideas are to put TaskManagers responsible for failed tasks in 
a "probationary" mode until they have reported back that everything is good 
(still alive, disk space available, etc)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to