[jira] [Commented] (FLINK-3443) JobManager cancel and clear everything fails jobs instead of cancelling

ASF GitHub Bot (JIRA) Fri, 19 Feb 2016 15:10:22 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155099#comment-15155099
 ]


ASF GitHub Bot commented on FLINK-3443:
---------------------------------------

Github user uce commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1669#discussion_r53531860
  
    --- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
    @@ -1487,7 +1487,7 @@ class JobManager(
               }
             }
     
    -        eg.fail(cause)
    +        eg.cancel()
    --- End diff --
    
    Good point with the TM logs.
    
    My main reason was that calls to `fail` (for example a shutdown 
`cancelAndClearEverything` or shutdown of the `InstanceManager`) can lead to 
the execution graph being restarted even though the job manager is shut down. 
The cancel call ensures that this does not happen and the execution graph 
eventually enters a terminal state.
    
    The main thing that triggered this change was the following: when you start 
a test cluster and shut it down while a job with a restart strategy is running 
and you *don't* immediately kill the process and have logging enabled, you see 
that the `ExecutionGraph` is still attempting to recover the job.
    
    What I don't understand is how this even happens when we shut down the 
`ExecutorService`. Any idea?
    
    Do you think there is another way to prevent this behaviour? I would be 
happy to keep the failure cause as before, but couldn't think of any other way.
    
    ---
    
    This has been changed as well: a `fail` will be ignored when the job is 
`cancelling` or `cancelled`. That's OK, right?


> JobManager cancel and clear everything fails jobs instead of cancelling
> -----------------------------------------------------------------------
>
>                 Key: FLINK-3443
>                 URL: https://issues.apache.org/jira/browse/FLINK-3443
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>
> When the job manager is shut down, it calls {{cancelAndClearEverything}}. 
> This method does not {{cancel}} the {{ExecutionGraph}} instances, but 
> {{fail}}s them, which can lead to {{ExecutionGraph}} restart.
> I've noticed this in tests, where old graph got into a loop of restarts.
> What I don't understand is why the futures etc. are not cancelled when the 
> executor service is shut down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3443) JobManager cancel and clear everything fails jobs instead of cancelling

Reply via email to