Till Rohrmann created FLINK-11813: ------------------------------------- Summary: Standby per job mode Dispatchers don't know job's JobSchedulingStatus Key: FLINK-11813 URL: https://issues.apache.org/jira/browse/FLINK-11813 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.7.2, 1.6.4, 1.8.0 Reporter: Till Rohrmann
At the moment, it can happen that standby {{Dispatchers}} in per job mode will restart a terminated job after they gained leadership. The problem is that we currently clear the {{RunningJobsRegistry}} once a job has reached a globally terminal state. After the leading {{Dispatcher}} terminates, a standby {{Dispatcher}} will gain leadership. Without having the information from the {{RunningJobsRegistry}} it cannot tell whether the job has been executed or whether the {{Dispatcher}} needs to re-execute the job. At the moment, the {{Dispatcher}} will assume that there was a fault and hence re-execute the job. This can lead to duplicate results. I think we need some way to tell standby {{Dispatchers}} that a certain job has been successfully executed. One trivial solution could be to not clean up the {{RunningJobsRegistry}} but then we will clutter ZooKeeper. -- This message was sent by Atlassian JIRA (v7.6.3#76005)