[ https://issues.apache.org/jira/browse/FLINK-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15532964#comment-15532964 ]
ASF GitHub Bot commented on FLINK-4711: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/2569 [FLINK-4711] Let the Task trigger partition state requests and handle their responses This PR makes changes the partition state check in a way that the Task is now responsible for triggering the state check instead of the SingleInputGate. Furthermore, the operation returns a future containing the JobManager's answer. That way we don't have to route the response through the TaskManager and can add automatic retries in case of a timeout. The PR removes the JobManagerCommunicationFactory and gets rid of the excessive PartitionStateChecker and ResultPartitionConsumableNotifier creation. Instead of creating for each SingleInputGate one PartitionStateChecker we create one for the TaskManager which is reused across all SingleInputGates. The same applies to the ResultPartitionConsumableNotifier. This PR is also a simplification for the Flip-6 implementation. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixOnUpdatePartitionState Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2569.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2569 ---- commit eefd4ee31633656d134078503a60f43e14806311 Author: Till Rohrmann <trohrm...@apache.org> Date: 2016-09-29T14:19:30Z [FLINK-4711] Let the Task trigger partition state requests and handle their responses This PR makes changes the partition state check in a way that the Task is now responsible for triggering the state check instead of the SingleInputGate. Furthermore, the operation returns a future containing the JobManager's answer. That way we don't have to route the response through the TaskManager and can add automatic retries in case of a timeout. The PR removes the JobManagerCommunicationFactory and gets rid of the excessive PartitionStateChecker and ResultPartitionConsumableNotifier creation. Instead of creating for each SingleInputGate one PartitionStateChecker we create one for the TaskManager which is reused across all SingleInputGates. The same applies to the ResultPartitionConsumableNotifier. ---- > TaskManager can crash due to failing onPartitionStateUpdate call > ---------------------------------------------------------------- > > Key: FLINK-4711 > URL: https://issues.apache.org/jira/browse/FLINK-4711 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.2.0 > Reporter: Till Rohrmann > Assignee: Till Rohrmann > Fix For: 1.2.0 > > > The {{TaskManager}} can crash because it calls > {{Task.onPartitionStateUpdate}} when it receives a {{PartitionState}} > message. The {{onPartitionStateUpdate}} method can throw an {{IOException}} > or {{InterruptedException}} which are not handled on the {{TaskManager}} > level. > Another problem is that the initial partition state request is triggered > within the {{SingleInputGate}}. The request causes the {{JobManager}} to send > a {{PartitionState}} message to the {{TaskManager}} which forwards it to the > {{Task}}. If the at any of these points a message gets lost, then it is not > retried and the partition state remains unknown. > In order to handle the exceptions, to make the data flow clearer and to add > automatic retries, I propose to let the {{Task}} send the partition state > check requests. Furthermore, the {{JobManager}} should directly answer to the > {{Task}} by replying to an ask operation. That way the message does not have > to be routed through the {{TaskManager}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)