Re: Delay in REST/UI readiness during JM recovery

Gary Yao Wed, 01 Aug 2018 23:20:02 -0700

Hi Joey,

If the other components (e.g., Dispatcher, ResourceManager) are able to
finish
the leader election in a timely manner, I currently do not see a reason why
it
should take the REST server 20 - 45 minutes.

You can check the contents of znode /flink/.../leader/rest_server_lock to
see
if there is indeed no leader, or if the leader information cannot be
retrieved
from ZooKeeper.

If you can reproduce this in a staging environment with some test jobs, I'd
like to see the ClusterEntrypoint/JobManager logs (perhaps on debug level).

Best,
Gary

On Mon, Jul 30, 2018 at 8:10 PM, Joey Echeverria <jechever...@splunk.com>
wrote:

> I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single
> Job Manager running. I’m using Zookeeper to store the fencing/leader
> information and S3 to store the job manager state. We’ve been running
> around 250 or so streaming jobs and we’ve noticed that if the job manager
> pod is deleted, it takes something like 20-45 minutes for the job manager’s
> REST endpoints and web UI to become available. Until it becomes available,
> we get a 503 response from the HTTP server with the message "Could not
> retrieve the redirect address of the current leader. Please try to
> refresh.”.
>
> Has anyone else run into this?
>
> Are there any configuration settings I should be looking at to speed up
> the availability of the HTTP endpoints?
>
> Thanks!
>
> -Joey

Re: Delay in REST/UI readiness during JM recovery

Reply via email to