Delay in REST/UI readiness during JM recovery

Joey Echeverria Mon, 30 Jul 2018 11:10:30 -0700

I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single Job 
Manager running. I’m using Zookeeper to store the fencing/leader information 
and S3 to store the job manager state. We’ve been running around 250 or so 
streaming jobs and we’ve noticed that if the job manager pod is deleted, it 
takes something like 20-45 minutes for the job manager’s REST endpoints and web 
UI to become available. Until it becomes available, we get a 503 response from 
the HTTP server with the message "Could not retrieve the redirect address of 
the current leader. Please try to refresh.”.


Has anyone else run into this?

Are there any configuration settings I should be looking at to speed up the 
availability of the HTTP endpoints?

Thanks!

-Joey

Delay in REST/UI readiness during JM recovery

Reply via email to