Hi Joey, If the other components (e.g., Dispatcher, ResourceManager) are able to finish the leader election in a timely manner, I currently do not see a reason why it should take the REST server 20 - 45 minutes.
You can check the contents of znode /flink/.../leader/rest_server_lock to see if there is indeed no leader, or if the leader information cannot be retrieved from ZooKeeper. If you can reproduce this in a staging environment with some test jobs, I'd like to see the ClusterEntrypoint/JobManager logs (perhaps on debug level). Best, Gary On Mon, Jul 30, 2018 at 8:10 PM, Joey Echeverria <jechever...@splunk.com> wrote: > I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single > Job Manager running. I’m using Zookeeper to store the fencing/leader > information and S3 to store the job manager state. We’ve been running > around 250 or so streaming jobs and we’ve noticed that if the job manager > pod is deleted, it takes something like 20-45 minutes for the job manager’s > REST endpoints and web UI to become available. Until it becomes available, > we get a 503 response from the HTTP server with the message "Could not > retrieve the redirect address of the current leader. Please try to > refresh.”. > > Has anyone else run into this? > > Are there any configuration settings I should be looking at to speed up > the availability of the HTTP endpoints? > > Thanks! > > -Joey