Till Rohrmann created FLINK-3345:
------------------------------------

             Summary: Restart TaskManager in case of a Akka quarantine event
                 Key: FLINK-3345
                 URL: https://issues.apache.org/jira/browse/FLINK-3345
             Project: Flink
          Issue Type: Improvement
          Components: Distributed Runtime
    Affects Versions: 1.0.0
            Reporter: Till Rohrmann


{{ActorSystems}} which get quarantined (death watch trigger, system message 
failure) are not able to reconnect to quarantining {{ActorSystem}}. In order to 
do that, the quarantined {{ActorSystem}} has to be restarted.

This is a problem for the {{TaskManager}}-{{JobManager}} communication. 
Whenever a {{TaskManager}} gets quarantined it is effectively useless for the 
Flink cluster, because it cannot reconnect to the {{JobManager}}. In such a 
case, the {{TaskManager}} would have to be restarted. 

The following link [1] describes how an {{ActorSystem}} can detect that it got 
quarantined.

When the TM detects that it got quarantined it should shut itself down. In 
order to restart the TM we could add a retry loop to the `taskmanager.sh` start 
script which restarts a TM in case of a non-zero return code.

[1] 
http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to