Till Rohrmann created FLINK-4711:
------------------------------------

             Summary: TaskManager can crash due to failing 
onPartitionStateUpdate call
                 Key: FLINK-4711
                 URL: https://issues.apache.org/jira/browse/FLINK-4711
             Project: Flink
          Issue Type: Bug
          Components: Distributed Coordination
    Affects Versions: 1.2.0
            Reporter: Till Rohrmann
            Assignee: Till Rohrmann
             Fix For: 1.2.0


The {{TaskManager}} can crash because it calls {{Task.onPartitionStateUpdate}} 
when it receives a {{PartitionState}} message. The {{onPartitionStateUpdate}} 
method can throw an {{IOException}} or {{InterruptedException}} which are not 
handled on the {{TaskManager}} level.

Another problem is that the initial partition state request is triggered within 
the {{SingleInputGate}}. The request causes the {{JobManager}} to send a 
{{PartitionState}} message to the {{TaskManager}} which forwards it to the 
{{Task}}. If the at any of these points a message gets lost, then it is not 
retried and the partition state remains unknown.

In order to handle the exceptions, to make the data flow clearer and to add 
automatic retries, I propose to let the {{Task}} send the partition state check 
requests. Furthermore, the {{JobManager}} should directly answer to the 
{{Task}} by replying to an ask operation. That way the message does not have to 
be routed through the {{TaskManager}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to