Hi,
I'm having some troubles with Flink jobmanagers in a HA setup within
OpenShift.
I have three jobmanagers, a Zookeeper cluster and a loadbalancer
(Openshift/Kubernetes Route) for the web ui / rest server on the
jobmanagers. Everything works fine, as long as the loadbalancer connects
to the leader. However, when the leader changes and the loadbalancer
connects to a non-leader, the jobmanager redirects to a leader using the
ip address of the host. Since the routing in our network is done using
hostnames, it doesn't know how to find the node using the ip address and
results in a timeout.
So I have a few questions:
1. Why is Flink using the ip addresses instead of the hostname which are
configured in the config? Other times it does use the hostname, like the
info send to Zookeeper.
2. Is there another way of coping with connections to non-leaders
instead of redirects? Maybe proxying through a non-leader to the leader?
Cheers,
Jeroen