I've created an issue [1] and opened a PR [2] to fix the issue.
[1] https://issues.apache.org/jira/browse/FLINK-3570
[2] https://github.com/apache/flink/pull/1758
Cheers,
Till
On Thu, Mar 3, 2016 at 12:33 PM, Maximilian Bode <
maximilian.b...@tngtech.com> wrote:
> Hi Ufuk, Till and Stephan,
>
Hi Ufuk, Till and Stephan,
Yes, that is what we observed. The primary hostname, i.e. the one returned by
the unix hostname command, is in fact bound to the eth0 interface, whereas
Flink uses the eth1 interface (pertaining to another hostname).
Changing akka.lookup.timeout to 100 s seems to fix
No I don't think this behaviour has been introduced by HA. That is the
default behaviour we used for a long time. If you think we should still
change it, then I can open an issue for it.
On Thu, Mar 3, 2016 at 12:20 PM, Stephan Ewen wrote:
> Okay, that is a change from the original behavior, int
Okay, that is a change from the original behavior, introduced in HA.
Originally, if the connection attempts failed, it always returned the
InetAddress.getLocalHost()
interface.
I think we should change it back to that, because that interface is by far
the best possible heuristic.
On Thu, Mar 3, 20
If I’m not mistaken, then it’s not necessarily true that the heuristic
returns InetAddress.getLocalHost() in all cases. The heuristic will select
the first network interface with the afore-mentioned conditions but before
returning it, it will try a last time to connect to the JM via the
interface b
If the ThasManager cannot connect to the JobManager, it will use the
interface that is bound to the machine's host name
("InetAddress.getLocalHost()").
So, the best way to fix this would be to make sure that all machines have a
proper network configuration. Then Flink would either use an address t
Hi Max,
the problem is that before starting the TM, we have to find the network
interface which is reachable by the other machines. So what we do is to
connect to the current JobManager. If it should happen, as in your case,
that the JobManager just died and the new JM address has not been written
I had an offline chat with Till about this. He pointed out that the
address is chosen once at start up time (while not being able to
connect to the old job manager) and then it stays fixed at eth1.
You can increase the lookup timeout by setting akka.lookup.timeout to
a higher value (like 100 s). T
Hey Max!
for the first WARN in
org.apache.flink.runtime.webmonitor.JobManagerRetriever: this is
expected if the new leader has not updated ZooKeeper yet. The
important thing is that the new leading job manager is eventually
retrieved. This did happen, right?
Regarding eth1 vs. eth0: After the new
Hi everyone,
we are trying to get to work JobManager HA in the context of a per-job YARN
session using the 1.0.0-rc3 from a few days ago and are having a problem
concerning task managers with several network interfaces.
After manually killing the job manager process, the jobmanager.log on the n
10 matches
Mail list logo