Github user uce commented on a diff in the pull request: https://github.com/apache/flink/pull/1669#discussion_r53531860 --- Diff: flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala --- @@ -1487,7 +1487,7 @@ class JobManager( } } - eg.fail(cause) + eg.cancel() --- End diff -- Good point with the TM logs. My main reason was that calls to `fail` (for example a shutdown `cancelAndClearEverything` or shutdown of the `InstanceManager`) can lead to the execution graph being restarted even though the job manager is shut down. The cancel call ensures that this does not happen and the execution graph eventually enters a terminal state. The main thing that triggered this change was the following: when you start a test cluster and shut it down while a job with a restart strategy is running and you *don't* immediately kill the process and have logging enabled, you see that the `ExecutionGraph` is still attempting to recover the job. What I don't understand is how this even happens when we shut down the `ExecutorService`. Any idea? Do you think there is another way to prevent this behaviour? I would be happy to keep the failure cause as before, but couldn't think of any other way. --- This has been changed as well: a `fail` will be ignored when the job is `cancelling` or `cancelled`. That's OK, right?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---