[jira] [Commented] (FLINK-5501) Determine whether the job starts from last JobManager failure

Zhijiang Wang (JIRA) Tue, 17 Jan 2017 19:16:37 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15827361#comment-15827361
 ]


Zhijiang Wang commented on FLINK-5501:
--------------------------------------

Thank you for the quick response!

Yeah, you already considered all the feasible alternatives to implement this 
goal and I totally agreed with that.

1. For extending the leader election service, I also thought of this way before 
implementation. For currently {{ZookeeperLeaderElectionService}}, the leader 
node is EPHEMERAL type, if the incrementing number is carried in this node, it 
should be changed to PERSISTENT type, otherwise there should add another node 
for incrementing number. This way is very similar with by 
{{RunningJobsRegistry}}, from semantic aspect, {{LeaderElectionService}} may be 
more suitable. But from minimum change aspect, I already implemented that by 
{{RunningJobsRegistry}}.

2. Actually I did not think of this way before, and it is an total different 
idea and interesting. The {{TaskManager}} is aware of {{JobManager}} leader 
change and will be re-register the new leader after changed. So the 
{{JobManager}} can resort to the registration process to determine the status.
But it may be complicated to coordinate between common schedule and 
reconciling, between they will be triggered at the same time. And also it will 
bring more resource waste temporarily. If the JobManager can determine the 
status after startup in an easy way, it can do the specific process and no need 
to do ambiguous thing.

In summary, I prefer the first way to implement the goal. And the whole 
{{JobManager}} failure feature has been finished in my side, could I submit the 
pull request for this issue based on {{RunningJobsRegistry}} implementation?

> Determine whether the job starts from last JobManager failure
> -------------------------------------------------------------
>
>                 Key: FLINK-5501
>                 URL: https://issues.apache.org/jira/browse/FLINK-5501
>             Project: Flink
>          Issue Type: Sub-task
>          Components: JobManager
>            Reporter: Zhijiang Wang
>            Assignee: Zhijiang Wang
>
> When the {{JobManagerRunner}} grants leadership, it should check whether the 
> current job is already running or not. If the job is running, the 
> {{JobManager}} should reconcile itself (enter RECONCILING state) and waits 
> for the {{TaskManager}} reporting task status. Otherwise the {{JobManger}} 
> can schedule the {{ExecutionGraph}} in common way.
> The {{RunningJobsRegistry}} can provide the way to check the job running 
> status, but we should expand the current interface and fix the related 
> process to support this function.
> 1. {{RunningJobsRegistry}} sets RUNNING status after {{JobManagerRunner}} 
> granting leadership at the first time.
> 2. If the job finishes, the job status will be set FINISHED by 
> {{RunningJobsRegistry}} and the status will be deleted before exit. 
> 3. If the mini cluster starts multi {{JobManagerRunner}}, and the leader 
> {{JobManagerRunner}} already finishes the job to set the job status FINISHED, 
> other {{JobManagerRunner}} will exit after grants the leadership again.
> 4. If the {{JobManager}} fails, the job status will be still in RUNNING. So 
> if the {{JobManagerRunner}} (the previous or new one) grants leadership 
> again, it will check the job status and enters {{RECONCILING}} state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-5501) Determine whether the job starts from last JobManager failure

Reply via email to