[ 
https://issues.apache.org/jira/browse/FLINK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15826357#comment-15826357
 ] 

Stephan Ewen commented on FLINK-5501:
-------------------------------------

I think the approach you outlined is good.

For thought and future reference, [~till.rohrmann] and me were thinking through 
the following alternatives as well that we rejected in the end:

  1. Extend the leader election service such that it carries an incrementing 
number when leaders change. If the leader is elected with {{0}} then it simply 
starts the job, if it is elected with something {{!= 0}}, it starts with 
reconciling. That approach, however, is not very suitable for cluster sessions, 
and does not have a good separation of concerns.

  2. JobManager always starts the job, and if a TaskManager registers as 
"reconciling", it cancels the job and goes to "reconciling".
    - Advantage: No special state, plus eager acquisition of resources in case 
no reconciliation happens
    - Disadvantage: Reconciliation is the more common case (assuming very long 
running streaming jobs) and this runs off "in the wrong direction" for the 
common case, triggering unnecessary resource allocation. It is also probably 
more complicated to implement.


> Determine whether the job starts from last JobManager failure
> -------------------------------------------------------------
>
>                 Key: FLINK-5501
>                 URL: https://issues.apache.org/jira/browse/FLINK-5501
>             Project: Flink
>          Issue Type: Sub-task
>          Components: JobManager
>            Reporter: Zhijiang Wang
>            Assignee: Zhijiang Wang
>
> When the {{JobManagerRunner}} grants leadership, it should check whether the 
> current job is already running or not. If the job is running, the 
> {{JobManager}} should reconcile itself (enter RECONCILING state) and waits 
> for the {{TaskManager}} reporting task status. Otherwise the {{JobManger}} 
> can schedule the {{ExecutionGraph}} in common way.
> The {{RunningJobsRegistry}} can provide the way to check the job running 
> status, but we should expand the current interface and fix the related 
> process to support this function.
> 1. {{RunningJobsRegistry}} sets RUNNING status after {{JobManagerRunner}} 
> granting leadership at the first time.
> 2. If the job finishes, the job status will be set FINISHED by 
> {{RunningJobsRegistry}} and the status will be deleted before exit. 
> 3. If the mini cluster starts multi {{JobManagerRunner}}, and the leader 
> {{JobManagerRunner}} already finishes the job to set the job status FINISHED, 
> other {{JobManagerRunner}} will exit after grants the leadership again.
> 4. If the {{JobManager}} fails, the job status will be still in RUNNING. So 
> if the {{JobManagerRunner}} (the previous or new one) grants leadership 
> again, it will check the job status and enters {{RECONCILING}} state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to