Re: YARN JobManager HA using wrong network interface

2016-03-03 Thread Till Rohrmann
I've created an issue [1] and opened a PR [2] to fix the issue. [1] https://issues.apache.org/jira/browse/FLINK-3570 [2] https://github.com/apache/flink/pull/1758 Cheers, Till On Thu, Mar 3, 2016 at 12:33 PM, Maximilian Bode < maximilian.b...@tngtech.com> wrote: > Hi Ufuk, Till and Stephan, >

Re: YARN JobManager HA using wrong network interface

2016-03-03 Thread Maximilian Bode
Hi Ufuk, Till and Stephan, Yes, that is what we observed. The primary hostname, i.e. the one returned by the unix hostname command, is in fact bound to the eth0 interface, whereas Flink uses the eth1 interface (pertaining to another hostname). Changing akka.lookup.timeout to 100 s seems to fix

Re: YARN JobManager HA using wrong network interface

2016-03-03 Thread Till Rohrmann
No I don't think this behaviour has been introduced by HA. That is the default behaviour we used for a long time. If you think we should still change it, then I can open an issue for it. On Thu, Mar 3, 2016 at 12:20 PM, Stephan Ewen wrote: > Okay, that is a change from the original behavior, int

Re: YARN JobManager HA using wrong network interface

2016-03-03 Thread Stephan Ewen
Okay, that is a change from the original behavior, introduced in HA. Originally, if the connection attempts failed, it always returned the InetAddress.getLocalHost() interface. I think we should change it back to that, because that interface is by far the best possible heuristic. On Thu, Mar 3, 20

Re: YARN JobManager HA using wrong network interface

2016-03-03 Thread Till Rohrmann
If I’m not mistaken, then it’s not necessarily true that the heuristic returns InetAddress.getLocalHost() in all cases. The heuristic will select the first network interface with the afore-mentioned conditions but before returning it, it will try a last time to connect to the JM via the interface b

Re: YARN JobManager HA using wrong network interface

2016-03-03 Thread Stephan Ewen
If the ThasManager cannot connect to the JobManager, it will use the interface that is bound to the machine's host name ("InetAddress.getLocalHost()"). So, the best way to fix this would be to make sure that all machines have a proper network configuration. Then Flink would either use an address t

Re: YARN JobManager HA using wrong network interface

2016-03-03 Thread Till Rohrmann
Hi Max, the problem is that before starting the TM, we have to find the network interface which is reachable by the other machines. So what we do is to connect to the current JobManager. If it should happen, as in your case, that the JobManager just died and the new JM address has not been written

Re: YARN JobManager HA using wrong network interface

2016-03-03 Thread Ufuk Celebi
I had an offline chat with Till about this. He pointed out that the address is chosen once at start up time (while not being able to connect to the old job manager) and then it stays fixed at eth1. You can increase the lookup timeout by setting akka.lookup.timeout to a higher value (like 100 s). T

Re: YARN JobManager HA using wrong network interface

2016-03-03 Thread Ufuk Celebi
Hey Max! for the first WARN in org.apache.flink.runtime.webmonitor.JobManagerRetriever: this is expected if the new leader has not updated ZooKeeper yet. The important thing is that the new leading job manager is eventually retrieved. This did happen, right? Regarding eth1 vs. eth0: After the new

YARN JobManager HA using wrong network interface

2016-03-03 Thread Maximilian Bode
Hi everyone, we are trying to get to work JobManager HA in the context of a per-job YARN session using the 1.0.0-rc3 from a few days ago and are having a problem concerning task managers with several network interfaces. After manually killing the job manager process, the jobmanager.log on the n