Github user uce commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1669#discussion_r53531860
  
    --- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
    @@ -1487,7 +1487,7 @@ class JobManager(
               }
             }
     
    -        eg.fail(cause)
    +        eg.cancel()
    --- End diff --
    
    Good point with the TM logs.
    
    My main reason was that calls to `fail` (for example a shutdown 
`cancelAndClearEverything` or shutdown of the `InstanceManager`) can lead to 
the execution graph being restarted even though the job manager is shut down. 
The cancel call ensures that this does not happen and the execution graph 
eventually enters a terminal state.
    
    The main thing that triggered this change was the following: when you start 
a test cluster and shut it down while a job with a restart strategy is running 
and you *don't* immediately kill the process and have logging enabled, you see 
that the `ExecutionGraph` is still attempting to recover the job.
    
    What I don't understand is how this even happens when we shut down the 
`ExecutorService`. Any idea?
    
    Do you think there is another way to prevent this behaviour? I would be 
happy to keep the failure cause as before, but couldn't think of any other way.
    
    ---
    
    This has been changed as well: a `fail` will be ignored when the job is 
`cancelling` or `cancelled`. That's OK, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to