Till Rohrmann created FLINK-4711: ------------------------------------ Summary: TaskManager can crash due to failing onPartitionStateUpdate call Key: FLINK-4711 URL: https://issues.apache.org/jira/browse/FLINK-4711 Project: Flink Issue Type: Bug Components: Distributed Coordination Affects Versions: 1.2.0 Reporter: Till Rohrmann Assignee: Till Rohrmann Fix For: 1.2.0
The {{TaskManager}} can crash because it calls {{Task.onPartitionStateUpdate}} when it receives a {{PartitionState}} message. The {{onPartitionStateUpdate}} method can throw an {{IOException}} or {{InterruptedException}} which are not handled on the {{TaskManager}} level. Another problem is that the initial partition state request is triggered within the {{SingleInputGate}}. The request causes the {{JobManager}} to send a {{PartitionState}} message to the {{TaskManager}} which forwards it to the {{Task}}. If the at any of these points a message gets lost, then it is not retried and the partition state remains unknown. In order to handle the exceptions, to make the data flow clearer and to add automatic retries, I propose to let the {{Task}} send the partition state check requests. Furthermore, the {{JobManager}} should directly answer to the {{Task}} by replying to an ask operation. That way the message does not have to be routed through the {{TaskManager}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)