[ 
https://issues.apache.org/jira/browse/FLINK-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818021#comment-16818021
 ] 

Till Rohrmann commented on FLINK-11813:
---------------------------------------

At the moment I would even question whether we need the {{RUNNING}} state. 
Without having JM reconciliation it simply is not needed. And also if we have 
such a feature, one could have a reconciliation period during which one does 
not recovers jobs. In the HA case, one would not even have to wait for this, 
because the leader election will make sure that there is only one JM executing 
the job.

Concerning point 4., it is important that the JM does not terminate (gives up 
leadership) until the {{Dispatcher}} has written the {{DONE}} state. 
Alternatively, one could allow the JM to write the {{DONE}} state. Otherwise 
another JM could gain leadership and sees that the state is not yet {{DONE}}.

Concerning the unification of the job and session cluster, I think we should 
not do this. We deliberately implemented the job cluster to not need an 
additional job submission step because it makes operations much easier if you 
don't have additional client cluster communication. Before we always had the 
problem that the client needs to poll the cluster status to know when to submit 
the job. This was quite brittle.

> Standby per job mode Dispatchers don't know job's JobSchedulingStatus
> ---------------------------------------------------------------------
>
>                 Key: FLINK-11813
>                 URL: https://issues.apache.org/jira/browse/FLINK-11813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> At the moment, it can happen that standby {{Dispatchers}} in per job mode 
> will restart a terminated job after they gained leadership. The problem is 
> that we currently clear the {{RunningJobsRegistry}} once a job has reached a 
> globally terminal state. After the leading {{Dispatcher}} terminates, a 
> standby {{Dispatcher}} will gain leadership. Without having the information 
> from the {{RunningJobsRegistry}} it cannot tell whether the job has been 
> executed or whether the {{Dispatcher}} needs to re-execute the job. At the 
> moment, the {{Dispatcher}} will assume that there was a fault and hence 
> re-execute the job. This can lead to duplicate results.
> I think we need some way to tell standby {{Dispatchers}} that a certain job 
> has been successfully executed. One trivial solution could be to not clean up 
> the {{RunningJobsRegistry}} but then we will clutter ZooKeeper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to