[ https://issues.apache.org/jira/browse/FLINK-4911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhijiang Wang reassigned FLINK-4911: ------------------------------------ Assignee: Zhijiang Wang > Non-disruptive JobManager Failures via Reconciliation > ------------------------------------------------------ > > Key: FLINK-4911 > URL: https://issues.apache.org/jira/browse/FLINK-4911 > Project: Flink > Issue Type: New Feature > Components: JobManager > Reporter: Stephan Ewen > Assignee: Zhijiang Wang > > JobManager failures can be handled in a non-disruptive way - by *reconciling* > the new JobManager leader and the TaskManagers. > I suggest to use this term (reconcile) - it has been uses also by other > frameworks (like Mesos) for non-disruptive handling of failures. > The basic approach is the following: > - When a JobManager fails, TaskManagers do not cancel tasks, but attempt to > reconnect to the JobManager > - On connect, the TaskManager tells the JobManager about its currently > running tasks > - A new JobManager waits for TaskManagers to connect and report a task > status. It re-constructs the ExecutionGraph state from these reports > - Tasks whose status was not reconstructed in a certain time are assumed > failed and trigger regular task recovery. > To avoid having to re-implement this for the new JobManager / TaskManager > approach in *flip-6*, I suggest to directly implement this into the > {{flip-6}} feature branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)