[ https://issues.apache.org/jira/browse/FLINK-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609915#comment-15609915 ]
ASF GitHub Bot commented on FLINK-3347: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2697 [backport] [FLINK-3347] [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down This is a back port for release 1.1. The QuarantineMonitor subscribes to the actor system's event bus and listens to AssociationErrorEvents. These are the events which are generated when the actor system has quarantined another actor system or if it has been quarantined by another actor system. In case of the quarantined state, the actor system will be shutdown killing all actors and then the JVM is terminated. Per default the `QuarantineMonitor` is not started for the `TaskManagers`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink backportQuarantineMonitor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2697.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2697 ---- commit 7d89c05aa335046b29f5e263a264dc1d3cdcd890 Author: Till Rohrmann <trohrm...@apache.org> Date: 2016-10-26T22:24:12Z [FLINK-3347] [akka] Add QuarantineMonitor which shuts a quarantined actor system and JVM down The QuarantineMonitor subscribes to the actor system's event bus and listens to AssociationErrorEvents. These are the events which are generated when the actor system has quarantined another actor system or if it has been quarantined by another actor system. In case of the quarantined state, the actor system will be shutdown killing all actors and then the JVM is terminated. commit 15c88f636fe2652eb937e0768dd379c058636199 Author: Till Rohrmann <trohrm...@apache.org> Date: 2016-10-26T22:39:17Z Add configuration switch to enable the quarantine monitor for TaskManagers Per default the QuarantineMonitor is disabled for TaskManagers in order to not change the behaviour of 1.1. ---- > TaskManager (or its ActorSystem) need to restart in case they notice > quarantine > ------------------------------------------------------------------------------- > > Key: FLINK-3347 > URL: https://issues.apache.org/jira/browse/FLINK-3347 > Project: Flink > Issue Type: Improvement > Components: TaskManager > Affects Versions: 0.10.1 > Reporter: Stephan Ewen > Assignee: Till Rohrmann > Priority: Critical > Fix For: 1.0.0, 1.2.0, 1.1.4 > > > There are cases where Akka quarantines remote actor systems. In that case, no > further communication is possible with that actor system unless one of the > two actor systems is restarted. > The result is that a TaskManager is up and available, but cannot register at > the JobManager (Akka refuses connection because of the quarantined state), > making the TaskManager a useless process. > I suggest to let the TaskManager restart itself once it notices that either > it quarantined the JobManager, or the JobManager quarantined it. > It is possible to recognize that by listening to certain events in the actor > system event stream: > http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state -- This message was sent by Atlassian JIRA (v6.3.4#6332)