[ 
https://issues.apache.org/jira/browse/FLINK-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114159#comment-16114159
 ] 

ASF GitHub Bot commented on FLINK-3347:
---------------------------------------

GitHub user NicoK opened a pull request:

    https://github.com/apache/flink/pull/4478

    [hotfix][docs] add documentation for `taskmanager.exit-on-fatal-akka-error`

    ## What is the purpose of the change
    
    When the quarantine monitor was added as of FLINK-3347, documentation for 
enabling it only went into the backport for the 1.2 and 1.1 branches, not into 
master and therefore not into the 1.3 release either. This adds it again and 
should also be applied to the `release-1.3` branch.
    
    ## Brief change log
    
    - add configuration documentation for `taskmanager.exit-on-fatal-akka-error`
    
    ## Verifying this change
    
    This change is a trivial rework / code cleanup without any test coverage.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NicoK/flink hotfix_quarantine_monitor_config

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4478.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4478
    
----
commit 6111a0626e13b85b8996dcdf9f3d741c23739cf5
Author: Nico Kruber <n...@data-artisans.com>
Date:   2017-08-04T09:11:35Z

    [hotfix][docs] add documentation for `taskmanager.exit-on-fatal-akka-error`
    
    When the quarantine monitor was added as of FLINK-3347, this documentation 
for
    enabling it only went into the backport for the 1.2 branch, not into master 
and
    therefore not into the 1.3 release either. This adds it again.

----


> TaskManager (or its ActorSystem) need to restart in case they notice 
> quarantine
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-3347
>                 URL: https://issues.apache.org/jira/browse/FLINK-3347
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 0.10.1
>            Reporter: Stephan Ewen
>            Assignee: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.0.0, 1.1.4, 1.3.0, 1.2.1
>
>
> There are cases where Akka quarantines remote actor systems. In that case, no 
> further communication is possible with that actor system unless one of the 
> two actor systems is restarted.
> The result is that a TaskManager is up and available, but cannot register at 
> the JobManager (Akka refuses connection because of the quarantined state), 
> making the TaskManager a useless process.
> I suggest to let the TaskManager restart itself once it notices that either 
> it quarantined the JobManager, or the JobManager quarantined it.
> It is possible to recognize that by listening to certain events in the actor 
> system event stream: 
> http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to