[jira] [Commented] (FLINK-3443) JobManager cancel and clear everything fails jobs instead of cancelling

ASF GitHub Bot (JIRA) Sun, 21 Feb 2016 02:57:01 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155982#comment-15155982
 ]


ASF GitHub Bot commented on FLINK-3443:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/1669#discussion_r53564976
  
    --- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
    @@ -1487,7 +1487,7 @@ class JobManager(
               }
             }
     
    -        eg.fail(cause)
    +        eg.cancel()
    --- End diff --
    
    What if we let the job fail with an UnrecoverableException upon JobManager
    termination?
    On Feb 20, 2016 12:08 AM, "Ufuk Celebi" <notificati...@github.com> wrote:
    
    > In
    > 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
    > <https://github.com/apache/flink/pull/1669#discussion_r53531860>:
    >
    > > @@ -1487,7 +1487,7 @@ class JobManager(
    > >            }
    > >          }
    > >
    > > -        eg.fail(cause)
    > > +        eg.cancel()
    >
    > Good point with the TM logs.
    >
    > My main reason was that calls to fail (for example a shutdown
    > cancelAndClearEverything or shutdown of the InstanceManager) can lead to
    > the execution graph being restarted even though the job manager is shut
    > down. The cancel call ensures that this does not happen and the execution
    > graph eventually enters a terminal state.
    >
    > The main thing that triggered this change was the following: when you
    > start a test cluster and shut it down while a job with a restart strategy
    > is running and you *don't* immediately kill the process and have logging
    > enabled, you see that the ExecutionGraph is still attempting to recover
    > the job.
    >
    > What I don't understand is how this even happens when we shut down the
    > ExecutorService. Any idea?
    >
    > Do you think there is another way to prevent this behaviour? I would be
    > happy to keep the failure cause as before, but couldn't think of any other
    > way.
    > ------------------------------
    >
    > This has been changed as well: a fail will be ignored when the job is
    > cancelling or cancelled. That's OK, right?
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/flink/pull/1669/files#r53531860>.
    >



> JobManager cancel and clear everything fails jobs instead of cancelling
> -----------------------------------------------------------------------
>
>                 Key: FLINK-3443
>                 URL: https://issues.apache.org/jira/browse/FLINK-3443
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>            Reporter: Ufuk Celebi
>            Assignee: Ufuk Celebi
>
> When the job manager is shut down, it calls {{cancelAndClearEverything}}. 
> This method does not {{cancel}} the {{ExecutionGraph}} instances, but 
> {{fail}}s them, which can lead to {{ExecutionGraph}} restart.
> I've noticed this in tests, where old graph got into a loop of restarts.
> What I don't understand is why the futures etc. are not cancelled when the 
> executor service is shut down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-3443) JobManager cancel and clear everything fails jobs instead of cancelling

Reply via email to