Robert Metzger created FLINK-2967:
-------------------------------------

             Summary: TM address detection might not always detect the right 
interface on slow networks / overloaded JMs
                 Key: FLINK-2967
                 URL: https://issues.apache.org/jira/browse/FLINK-2967
             Project: Flink
          Issue Type: Bug
    Affects Versions: 0.9, 0.10, 1.0
            Reporter: Robert Metzger
            Assignee: Robert Metzger


I'm talking to a user which is facing the following issue:
Some of the TaskManagers select the wrong IP address out of the available 
network interfaces.

The first address we try to connect to is the one returned by 
{{InetAddress.getLocalHost()}}. This address is the right IP address to use, 
but the JobManager is not able to respond within the timeout (50ms) to that 
connection request.
So the TM tries the next address, which is not publicly reachable. However, the 
TM can connect to the JM from there. Netty will later fail to connect to the TM 
from the other TMs.

There are two solutions for this issue:
- Allow users to configure a higher timeout for the first address detection 
strategy. In most cases, the address returned by {{InetAddress.getLocalHost()}} 
is correct. By setting a high timeout, users with slow networks / overloaded 
JMs can make sure the TM picks this address
- add an Akka message which we send from the TM to the JM, and the JM tries to 
connect to the TM. If that succeeds, we know that the TM is reachable from the 
outside.
The problem is that we have to start a separate actor system on the TaskManager 
first. We have to do this because might use a wrong ip address for the TM (so 
we might end up starting actor systems until we found an externally reachable 
ip)

I'm first going to implement the first approach. If that solution works well 
for my user, I'll contribute this to 0.10 / 1.0.
If not, I'll implement the second approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to