[ https://issues.apache.org/jira/browse/FLINK-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15007263#comment-15007263 ]
ASF GitHub Bot commented on FLINK-2967: --------------------------------------- GitHub user rmetzger opened a pull request: https://github.com/apache/flink/pull/1361 [FLINK-2967] Enhance TaskManager network detection JIRA: https://issues.apache.org/jira/browse/FLINK-2967 - Increase timeout for `LOCAL_HOST` address detection strategy - give the local host address a higher priority You can merge this pull request into a Git repository by running: $ git pull https://github.com/rmetzger/flink flink2967-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/1361.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1361 ---- commit 859a19fdf7c6360765cba8706d356f0d00959128 Author: Robert Metzger <rmetz...@apache.org> Date: 2015-11-16T20:26:57Z [FLINK-2967] Increase timeout for LOCAL_HOST address detection strategy, give the local host address a higher priority ---- > TM address detection might not always detect the right interface on slow > networks / overloaded JMs > -------------------------------------------------------------------------------------------------- > > Key: FLINK-2967 > URL: https://issues.apache.org/jira/browse/FLINK-2967 > Project: Flink > Issue Type: Bug > Affects Versions: 0.9, 0.10.0, 1.0.0 > Reporter: Robert Metzger > Assignee: Robert Metzger > > I'm talking to a user which is facing the following issue: > Some of the TaskManagers select the wrong IP address out of the available > network interfaces. > The first address we try to connect to is the one returned by > {{InetAddress.getLocalHost()}}. This address is the right IP address to use, > but the JobManager is not able to respond within the timeout (50ms) to that > connection request. > So the TM tries the next address, which is not publicly reachable. However, > the TM can connect to the JM from there. Netty will later fail to connect to > the TM from the other TMs. > There are two solutions for this issue: > - Allow users to configure a higher timeout for the first address detection > strategy. In most cases, the address returned by > {{InetAddress.getLocalHost()}} is correct. By setting a high timeout, users > with slow networks / overloaded JMs can make sure the TM picks this address > - add an Akka message which we send from the TM to the JM, and the JM tries > to connect to the TM. If that succeeds, we know that the TM is reachable from > the outside. > The problem is that we have to start a separate actor system on the > TaskManager first. We have to do this because might use a wrong ip address > for the TM (so we might end up starting actor systems until we found an > externally reachable ip) > I'm first going to implement the first approach. If that solution works well > for my user, I'll contribute this to 0.10 / 1.0. > If not, I'll implement the second approach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)