[ 
https://issues.apache.org/jira/browse/FLINK-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15007263#comment-15007263
 ] 

ASF GitHub Bot commented on FLINK-2967:
---------------------------------------

GitHub user rmetzger opened a pull request:

    https://github.com/apache/flink/pull/1361

    [FLINK-2967] Enhance TaskManager network detection

    JIRA: https://issues.apache.org/jira/browse/FLINK-2967
    
    - Increase timeout for `LOCAL_HOST` address detection strategy
    - give the local host address a higher priority

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rmetzger/flink flink2967-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1361.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1361
    
----
commit 859a19fdf7c6360765cba8706d356f0d00959128
Author: Robert Metzger <rmetz...@apache.org>
Date:   2015-11-16T20:26:57Z

    [FLINK-2967] Increase timeout for LOCAL_HOST address detection strategy, 
give the local host address a higher priority

----


> TM address detection might not always detect the right interface on slow 
> networks / overloaded JMs
> --------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-2967
>                 URL: https://issues.apache.org/jira/browse/FLINK-2967
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 0.9, 0.10.0, 1.0.0
>            Reporter: Robert Metzger
>            Assignee: Robert Metzger
>
> I'm talking to a user which is facing the following issue:
> Some of the TaskManagers select the wrong IP address out of the available 
> network interfaces.
> The first address we try to connect to is the one returned by 
> {{InetAddress.getLocalHost()}}. This address is the right IP address to use, 
> but the JobManager is not able to respond within the timeout (50ms) to that 
> connection request.
> So the TM tries the next address, which is not publicly reachable. However, 
> the TM can connect to the JM from there. Netty will later fail to connect to 
> the TM from the other TMs.
> There are two solutions for this issue:
> - Allow users to configure a higher timeout for the first address detection 
> strategy. In most cases, the address returned by 
> {{InetAddress.getLocalHost()}} is correct. By setting a high timeout, users 
> with slow networks / overloaded JMs can make sure the TM picks this address
> - add an Akka message which we send from the TM to the JM, and the JM tries 
> to connect to the TM. If that succeeds, we know that the TM is reachable from 
> the outside.
> The problem is that we have to start a separate actor system on the 
> TaskManager first. We have to do this because might use a wrong ip address 
> for the TM (so we might end up starting actor systems until we found an 
> externally reachable ip)
> I'm first going to implement the first approach. If that solution works well 
> for my user, I'll contribute this to 0.10 / 1.0.
> If not, I'll implement the second approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to