[jira] [Commented] (FLINK-11813) Standby per job mode Dispatchers don't know job's JobSchedulingStatus

TisonKun (JIRA) Wed, 10 Apr 2019 04:24:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16814355#comment-16814355
 ]


TisonKun commented on FLINK-11813:
----------------------------------

And as discussed in 
[PR#7889|https://github.com/apache/flink/pull/7889#pullrequestreview-210604709],
 if we want to use different Dispatcher instance between leader sessions, on 
the former leader terminated, possibly it found no other contenders but itself 
could be re-granted later. In this case we'd better not to clean up any data in 
zookeeper.

Further, it could be a possible race condition that even if a dispatcher can 
detect no other leader contenders but after the detection a contender launched.

---

For your two questions, the problem is how we identify "the same job". If we 
say, any jobs with the same job id is identical, then we always need a DONE 
entry. Otherwise we might identify "the same job" as jobs with the same job id 
and have overlap runtime lifecycle, that is, two job instances with the same 
job id but submitted separated by a long time are considered to be two 
different jobs. In this case we can drop the use of DONE. Besides, any of 
strategies mentioned above only identify "the same job" in the same cluster 
because we clean up datas in high-availability backend on cluster terminate.

> Standby per job mode Dispatchers don't know job's JobSchedulingStatus
> ---------------------------------------------------------------------
>
>                 Key: FLINK-11813
>                 URL: https://issues.apache.org/jira/browse/FLINK-11813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> At the moment, it can happen that standby {{Dispatchers}} in per job mode 
> will restart a terminated job after they gained leadership. The problem is 
> that we currently clear the {{RunningJobsRegistry}} once a job has reached a 
> globally terminal state. After the leading {{Dispatcher}} terminates, a 
> standby {{Dispatcher}} will gain leadership. Without having the information 
> from the {{RunningJobsRegistry}} it cannot tell whether the job has been 
> executed or whether the {{Dispatcher}} needs to re-execute the job. At the 
> moment, the {{Dispatcher}} will assume that there was a fault and hence 
> re-execute the job. This can lead to duplicate results.
> I think we need some way to tell standby {{Dispatchers}} that a certain job 
> has been successfully executed. One trivial solution could be to not clean up 
> the {{RunningJobsRegistry}} but then we will clutter ZooKeeper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11813) Standby per job mode Dispatchers don't know job's JobSchedulingStatus

Reply via email to