[jira] [Commented] (FLINK-4711) TaskManager can crash due to failing onPartitionStateUpdate call

ASF GitHub Bot (JIRA) Thu, 29 Sep 2016 07:43:32 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-4711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15532964#comment-15532964
 ]


ASF GitHub Bot commented on FLINK-4711:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/2569

    [FLINK-4711] Let the Task trigger partition state requests and handle their 
responses

    This PR makes changes the partition state check in a way that the Task is 
now responsible
    for triggering the state check instead of the SingleInputGate. Furthermore, 
the operation
    returns a future containing the JobManager's answer. That way we don't have 
to route the
    response through the TaskManager and can add automatic retries in case of a 
timeout.
    
    The PR removes the JobManagerCommunicationFactory and gets rid of the 
excessive
    PartitionStateChecker and ResultPartitionConsumableNotifier creation. 
Instead of creating
    for each SingleInputGate one PartitionStateChecker we create one for the 
TaskManager which
    is reused across all SingleInputGates. The same applies to the
    ResultPartitionConsumableNotifier.
    
    This PR is also a simplification for the Flip-6 implementation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixOnUpdatePartitionState

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2569.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2569
    
----
commit eefd4ee31633656d134078503a60f43e14806311
Author: Till Rohrmann <trohrm...@apache.org>
Date:   2016-09-29T14:19:30Z

    [FLINK-4711] Let the Task trigger partition state requests and handle their 
responses
    
    This PR makes changes the partition state check in a way that the Task is 
now responsible
    for triggering the state check instead of the SingleInputGate. Furthermore, 
the operation
    returns a future containing the JobManager's answer. That way we don't have 
to route the
    response through the TaskManager and can add automatic retries in case of a 
timeout.
    
    The PR removes the JobManagerCommunicationFactory and gets rid of the 
excessive
    PartitionStateChecker and ResultPartitionConsumableNotifier creation. 
Instead of creating
    for each SingleInputGate one PartitionStateChecker we create one for the 
TaskManager which
    is reused across all SingleInputGates. The same applies to the
    ResultPartitionConsumableNotifier.

----


> TaskManager can crash due to failing onPartitionStateUpdate call
> ----------------------------------------------------------------
>
>                 Key: FLINK-4711
>                 URL: https://issues.apache.org/jira/browse/FLINK-4711
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.2.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>             Fix For: 1.2.0
>
>
> The {{TaskManager}} can crash because it calls 
> {{Task.onPartitionStateUpdate}} when it receives a {{PartitionState}} 
> message. The {{onPartitionStateUpdate}} method can throw an {{IOException}} 
> or {{InterruptedException}} which are not handled on the {{TaskManager}} 
> level.
> Another problem is that the initial partition state request is triggered 
> within the {{SingleInputGate}}. The request causes the {{JobManager}} to send 
> a {{PartitionState}} message to the {{TaskManager}} which forwards it to the 
> {{Task}}. If the at any of these points a message gets lost, then it is not 
> retried and the partition state remains unknown.
> In order to handle the exceptions, to make the data flow clearer and to add 
> automatic retries, I propose to let the {{Task}} send the partition state 
> check requests. Furthermore, the {{JobManager}} should directly answer to the 
> {{Task}} by replying to an ask operation. That way the message does not have 
> to be routed through the {{TaskManager}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-4711) TaskManager can crash due to failing onPartitionStateUpdate call

Reply via email to