[
https://issues.apache.org/jira/browse/FLINK-18959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188303#comment-17188303
]
Till Rohrmann commented on FLINK-18959:
---------------------------------------
Yes exactly. The scope of this ticket is to fix the current regression which
means to that cancelling a job should still trigger its archiving.
> Fail to archiveExecutionGraph because job is not finished when dispatcher
> close
> -------------------------------------------------------------------------------
>
> Key: FLINK-18959
> URL: https://issues.apache.org/jira/browse/FLINK-18959
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.10.0, 1.12.0, 1.11.1
> Reporter: Liu
> Assignee: Liu
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.12.0, 1.11.2, 1.10.3
>
> Attachments: flink-debug-log
>
>
> When job is cancelled, we expect to see it in flink's history server. But I
> can not see my job after it is cancelled.
> After digging into the problem, I find that the function
> archiveExecutionGraph is not executed. Below is the brief log:
> {panel:title=log}
> 2020-08-14 15:10:06,406 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph
> [flink-akka.actor.default-dispatcher- 15] - Job EtlAndWindow
> (6f784d4cc5bae88a332d254b21660372) switched from state RUNNING to CANCELLING.
> 2020-08-14 15:10:06,415 DEBUG
> org.apache.flink.runtime.dispatcher.MiniDispatcher
> [flink-akka.actor.default-dispatcher-3] - Shutting down per-job cluster
> because the job was canceled.
> 2020-08-14 15:10:06,629 INFO
> org.apache.flink.runtime.dispatcher.MiniDispatcher
> [flink-akka.actor.default-dispatcher-3] - Stopping dispatcher
> akka.tcp://[email protected]:38663/user/dispatcher.
> 2020-08-14 15:10:06,629 INFO
> org.apache.flink.runtime.dispatcher.MiniDispatcher
> [flink-akka.actor.default-dispatcher-3] - Stopping all currently running jobs
> of dispatcher akka.tcp://[email protected]:38663/user/dispatcher.
> 2020-08-14 15:10:06,631 INFO org.apache.flink.runtime.jobmaster.JobMaster
> [flink-akka.actor.default-dispatcher-29] - Stopping the JobMaster for job
> EtlAndWindow(6f784d4cc5bae88a332d254b21660372).
> 2020-08-14 15:10:06,632 DEBUG org.apache.flink.runtime.jobmaster.JobMaster
> [flink-akka.actor.default-dispatcher-29] - Disconnect TaskExecutor
> container_e144_1590060720089_2161_01_000006 because: Stopping JobMaster for
> job EtlAndWindow(6f784d4cc5bae88a332d254b21660372).
> 2020-08-14 15:10:06,646 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph
> [flink-akka.actor.default-dispatcher-29] - Job EtlAndWindow
> (6f784d4cc5bae88a332d254b21660372) switched from state CANCELLING to CANCELED.
> 2020-08-14 15:10:06,664 DEBUG
> org.apache.flink.runtime.dispatcher.MiniDispatcher
> [flink-akka.actor.default-dispatcher-4] - There is a newer JobManagerRunner
> for the job 6f784d4cc5bae88a332d254b21660372.
> {panel}
> From the log, we can see that job is not finished when dispatcher closes. The
> process is as following:
> * Receive cancel command and send it to all tasks async.
> * In MiniDispatcher, begin to shutting down per-job cluster.
> * Stopping dispatcher and remove job.
> * Job is cancelled and callback is executed in method startJobManagerRunner.
> * Because job is removed before, so currentJobManagerRunner is null which
> not equals to the original jobManagerRunner. In this case,
> archivedExecutionGraph will not be uploaded.
> In normal cases, I find that job is cancelled first and then dispatcher is
> stopped so that archivedExecutionGraph will succeed. But I think that the
> order is not constrained and it is hard to know which comes first.
> Above is what I suspected. If so, then we should fix it.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)