I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single Job Manager running. I’m using Zookeeper to store the fencing/leader information and S3 to store the job manager state. We’ve been running around 250 or so streaming jobs and we’ve noticed that if the job manager pod is deleted, it takes something like 20-45 minutes for the job manager’s REST endpoints and web UI to become available. Until it becomes available, we get a 503 response from the HTTP server with the message "Could not retrieve the redirect address of the current leader. Please try to refresh.”.
Has anyone else run into this? Are there any configuration settings I should be looking at to speed up the availability of the HTTP endpoints? Thanks! -Joey