Thanks or the tips Gary and Vino. I’ll try to reproduce it with test data and 
see if I can post some logs.

I’ll also watch the leader znode to see if the election isn’t happening or if 
it’s not being retrieved.

Thanks!

-Joey

On Aug 1, 2018, at 11:19 PM, Gary Yao 
<g...@data-artisans.com<mailto:g...@data-artisans.com>> wrote:

Hi Joey,

If the other components (e.g., Dispatcher, ResourceManager) are able to finish
the leader election in a timely manner, I currently do not see a reason why it
should take the REST server 20 - 45 minutes.

You can check the contents of znode /flink/.../leader/rest_server_lock to see
if there is indeed no leader, or if the leader information cannot be retrieved
from ZooKeeper.

If you can reproduce this in a staging environment with some test jobs, I'd
like to see the ClusterEntrypoint/JobManager logs (perhaps on debug level).

Best,
Gary

On Mon, Jul 30, 2018 at 8:10 PM, Joey Echeverria 
<jechever...@splunk.com<mailto:jechever...@splunk.com>> wrote:
I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single Job 
Manager running. I’m using Zookeeper to store the fencing/leader information 
and S3 to store the job manager state. We’ve been running around 250 or so 
streaming jobs and we’ve noticed that if the job manager pod is deleted, it 
takes something like 20-45 minutes for the job manager’s REST endpoints and web 
UI to become available. Until it becomes available, we get a 503 response from 
the HTTP server with the message "Could not retrieve the redirect address of 
the current leader. Please try to refresh.”.

Has anyone else run into this?

Are there any configuration settings I should be looking at to speed up the 
availability of the HTTP endpoints?

Thanks!

-Joey


Reply via email to