Till Rohrmann created FLINK-2289: ------------------------------------ Summary: Make JobManager highly available Key: FLINK-2289 URL: https://issues.apache.org/jira/browse/FLINK-2289 Project: Flink Issue Type: Improvement Reporter: Till Rohrmann Assignee: Till Rohrmann
Currently, the {{JobManager}} is the single point of failure in the Flink system. If it fails, then your job cannot be recovered and the Flink cluster is no longer able to receive new jobs. Therefore, it is crucial to make the {{JobManager}} fault tolerant so that the Flink cluster can recover from failed {{JobManager}}. As a first step towards this goal, I propose to make the {{JobManager}} highly available by starting multiple instances and using Apache ZooKeeper to elect a leader. The leader is responsible for the execution of the Flink job. In case that the {{JobManager}} dies, one of the other running {{JobManager}} will be elected as the leader and take over the role of the leader. The {{Client}} and the {{TaskManager}} will automatically detect the new {{JobManager}} by querying the ZooKeeper cluster. Note that this does not achieve full fault tolerance for the {{JobManager}} but it allows the cluster to recover from failed {{JobManager}}. The design of high-availability for the {{JobManager}} is tracked in the wiki here [1]. Resources: [1] [https://cwiki.apache.org/confluence/display/FLINK/JobManager+High+Availability] -- This message was sent by Atlassian JIRA (v6.3.4#6332)