[ https://issues.apache.org/jira/browse/FLINK-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155982#comment-15155982 ]
ASF GitHub Bot commented on FLINK-3443: --------------------------------------- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/1669#discussion_r53564976 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1487,7 +1487,7 @@ class JobManager( } } - eg.fail(cause) + eg.cancel() --- End diff -- What if we let the job fail with an UnrecoverableException upon JobManager termination? On Feb 20, 2016 12:08 AM, "Ufuk Celebi" <notificati...@github.com> wrote: > In > flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala > <https://github.com/apache/flink/pull/1669#discussion_r53531860>: > > > @@ -1487,7 +1487,7 @@ class JobManager( > > } > > } > > > > - eg.fail(cause) > > + eg.cancel() > > Good point with the TM logs. > > My main reason was that calls to fail (for example a shutdown > cancelAndClearEverything or shutdown of the InstanceManager) can lead to > the execution graph being restarted even though the job manager is shut > down. The cancel call ensures that this does not happen and the execution > graph eventually enters a terminal state. > > The main thing that triggered this change was the following: when you > start a test cluster and shut it down while a job with a restart strategy > is running and you *don't* immediately kill the process and have logging > enabled, you see that the ExecutionGraph is still attempting to recover > the job. > > What I don't understand is how this even happens when we shut down the > ExecutorService. Any idea? > > Do you think there is another way to prevent this behaviour? I would be > happy to keep the failure cause as before, but couldn't think of any other > way. > ------------------------------ > > This has been changed as well: a fail will be ignored when the job is > cancelling or cancelled. That's OK, right? > > — > Reply to this email directly or view it on GitHub > <https://github.com/apache/flink/pull/1669/files#r53531860>. > > JobManager cancel and clear everything fails jobs instead of cancelling > ----------------------------------------------------------------------- > > Key: FLINK-3443 > URL: https://issues.apache.org/jira/browse/FLINK-3443 > Project: Flink > Issue Type: Bug > Components: Distributed Runtime > Reporter: Ufuk Celebi > Assignee: Ufuk Celebi > > When the job manager is shut down, it calls {{cancelAndClearEverything}}. > This method does not {{cancel}} the {{ExecutionGraph}} instances, but > {{fail}}s them, which can lead to {{ExecutionGraph}} restart. > I've noticed this in tests, where old graph got into a loop of restarts. > What I don't understand is why the futures etc. are not cancelled when the > executor service is shut down. -- This message was sent by Atlassian JIRA (v6.3.4#6332)