[ 
https://issues.apache.org/jira/browse/FLINK-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818575#comment-16818575
 ] 

Zhu Zhu edited comment on FLINK-11813 at 4/16/19 3:15 AM:
----------------------------------------------------------

I think with SubmittedJobGraphStore been a underlying layer of 
RunningJobsRegistry, there is no need to update the job status to RUNNING 
explicitly. We may wrap them in a *JobStore*(or simply extend the 
SubmittedJobGraphStore interface) to not only provide submitted JobGraphs but 
also support JobSchedulingStatus queries.

There can be only 2 operations to the store:
 # _*addJob(submittedJobGraph)*_ to add a newly submitted JobGraph
 # _*markDone(jobID)*_ to mark the job status to be DONE, which should also be 
stored in the SubmittedJobGraphStore (we can even drop the graph file and keep 
the DONE status only) (b.t.w. the word *DONE* seems to mean that the job is 
FINISHED, not CANCELLED or FAILED, should we use a more accurate word like 
*TERMINATED*?)

 

The underly status will be:
 # NONE: job graph does not exist
 # RUNNING: job graph exists and not DONE
 # DONE: job graph exists and DONE

So the JobSchedulingStatus would be transitioned as below:

*NONE* -- _addJob_ --> *RUNNING* – _markDone_ --> *DONE*

 

For job mode, we may need to change current SingleJobSubmittedJobGraphStore to 
an HA SubmittedJobGraphStore, which would then make the running status sharing 
possible.The job mode dispatcher(MiniDispatcher) should add the embedded 
jobGraph to the JobStore once it is granted leadership(duplicated jobGraph will 
be ignored).

 

 

 


was (Author: zhuzh):
I think with SubmittedJobGraphStore been a underlying layer of 
RunningJobsRegistry, there is no need to update the job status to RUNNING 
explicitly. We may wrap them in a *JobStore*(or simply extend the 
SubmittedJobGraphStore interface) to not only provide submitted JobGraphs but 
also support JobSchedulingStatus queries.

There can be only 2 operations to the store:
 # _*addJob(submittedJobGraph)*_ to add a newly submitted JobGraph
 # _*markDone(jobID)*_ to mark the job status to be DONE, which should also be 
stored in the SubmittedJobGraphStore (we can even drop the graph file and keep 
the DONE status only) (b.t.w. the word *DONE* seems to mean that the job is 
FINISHED, not CANCELLED or FAILED, should we use a more accurate work like 
*TERMINATED*?)

 

The underly status will be:
 # NONE: job graph does not exist
 # RUNNING: job graph exists and not DONE
 # DONE: job graph exists and DONE

So the JobSchedulingStatus would be transitioned as below:

*NONE* -- _addJob_ --> *RUNNING* – _markDone_ --> *DONE*

 

For job mode, we may need to change current SingleJobSubmittedJobGraphStore to 
an HA SubmittedJobGraphStore, which would then make the running status sharing 
possible.The job mode dispatcher(MiniDispatcher) should add the embedded 
jobGraph to the JobStore once it is granted leadership(duplicated jobGraph will 
be ignored).

 

 

 

> Standby per job mode Dispatchers don't know job's JobSchedulingStatus
> ---------------------------------------------------------------------
>
>                 Key: FLINK-11813
>                 URL: https://issues.apache.org/jira/browse/FLINK-11813
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> At the moment, it can happen that standby {{Dispatchers}} in per job mode 
> will restart a terminated job after they gained leadership. The problem is 
> that we currently clear the {{RunningJobsRegistry}} once a job has reached a 
> globally terminal state. After the leading {{Dispatcher}} terminates, a 
> standby {{Dispatcher}} will gain leadership. Without having the information 
> from the {{RunningJobsRegistry}} it cannot tell whether the job has been 
> executed or whether the {{Dispatcher}} needs to re-execute the job. At the 
> moment, the {{Dispatcher}} will assume that there was a fault and hence 
> re-execute the job. This can lead to duplicate results.
> I think we need some way to tell standby {{Dispatchers}} that a certain job 
> has been successfully executed. One trivial solution could be to not clean up 
> the {{RunningJobsRegistry}} but then we will clutter ZooKeeper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to