[ https://issues.apache.org/jira/browse/FLINK-18959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann updated FLINK-18959: ---------------------------------- Fix Version/s: 1.10.3 1.11.2 1.12.0 > Fail to archiveExecutionGraph because job is not finished when dispatcher > close > ------------------------------------------------------------------------------- > > Key: FLINK-18959 > URL: https://issues.apache.org/jira/browse/FLINK-18959 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.10.0, 1.12.0, 1.11.1 > Reporter: Liu > Priority: Minor > Fix For: 1.12.0, 1.11.2, 1.10.3 > > Attachments: flink-debug-log > > > When job is cancelled, we expect to see it in flink's history server. But I > can not see my job after it is cancelled. > After digging into the problem, I find that the function > archiveExecutionGraph is not executed. Below is the brief log: > {panel:title=log} > 2020-08-14 15:10:06,406 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher- 15] - Job EtlAndWindow > (6f784d4cc5bae88a332d254b21660372) switched from state RUNNING to CANCELLING. > 2020-08-14 15:10:06,415 DEBUG > org.apache.flink.runtime.dispatcher.MiniDispatcher > [flink-akka.actor.default-dispatcher-3] - Shutting down per-job cluster > because the job was canceled. > 2020-08-14 15:10:06,629 INFO > org.apache.flink.runtime.dispatcher.MiniDispatcher > [flink-akka.actor.default-dispatcher-3] - Stopping dispatcher > akka.tcp://flink@bjfk-c9865.yz02:38663/user/dispatcher. > 2020-08-14 15:10:06,629 INFO > org.apache.flink.runtime.dispatcher.MiniDispatcher > [flink-akka.actor.default-dispatcher-3] - Stopping all currently running jobs > of dispatcher akka.tcp://flink@bjfk-c9865.yz02:38663/user/dispatcher. > 2020-08-14 15:10:06,631 INFO org.apache.flink.runtime.jobmaster.JobMaster > [flink-akka.actor.default-dispatcher-29] - Stopping the JobMaster for job > EtlAndWindow(6f784d4cc5bae88a332d254b21660372). > 2020-08-14 15:10:06,632 DEBUG org.apache.flink.runtime.jobmaster.JobMaster > [flink-akka.actor.default-dispatcher-29] - Disconnect TaskExecutor > container_e144_1590060720089_2161_01_000006 because: Stopping JobMaster for > job EtlAndWindow(6f784d4cc5bae88a332d254b21660372). > 2020-08-14 15:10:06,646 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-29] - Job EtlAndWindow > (6f784d4cc5bae88a332d254b21660372) switched from state CANCELLING to CANCELED. > 2020-08-14 15:10:06,664 DEBUG > org.apache.flink.runtime.dispatcher.MiniDispatcher > [flink-akka.actor.default-dispatcher-4] - There is a newer JobManagerRunner > for the job 6f784d4cc5bae88a332d254b21660372. > {panel} > From the log, we can see that job is not finished when dispatcher close. The > process is as following: > * Receive cancel command and send it to all tasks async. > * In MiniDispatcher, begin to shutting down per-job cluster. > * Stopping dispatcher and remove job. > * Job is cancelled and callback is executed in method startJobManagerRunner. > * Because job is removed before, so currentJobManagerRunner is null which > not equals to the original jobManagerRunner. In this case, > archivedExecutionGraph will not be uploaded. > In normal cases, I find that job is cancelled first and then dispatcher is > stopped so that archivedExecutionGraph will succeed. But I think that the > order is not constrained and it is hard to know which comes first. > Above is what I suspected. If so, then we should fix it. > -- This message was sent by Atlassian Jira (v8.3.4#803005)