Hi Joey, Good question! I will copy it to Till and Chesnay who know this part of the implementation.
Thanks, vino. 2018-08-03 11:09 GMT+08:00 Joey Echeverria <jechever...@splunk.com>: > I don’t have logs available yet, but I do have some information from ZK. > > The culprit appears to be the /flink/default/leader/dispatcher_lock znode. > > I took a look at the dispatcher code here: https://github.com/ > apache/flink/blob/master/flink-runtime/src/main/java/ > org/apache/flink/runtime/dispatcher/Dispatcher.java#L762-L785 > > And it looks to me that when leadership is granted it will perform job > recovery on all jobs before it writes the new leader information to > the /flink/default/leader/dispatcher_lock znode. > > So this leaves me with three questions: > > 1) Why does the web monitor specifically have to wait for the dispatcher? > 2) Is there a reason why the dispatcher can’t write the lock until after > job recovery? > 3) Is there anything I can/should be doing to speed up job recovery? > > Thanks! > > -Joey > > > On Aug 2, 2018, at 9:24 AM, Joey Echeverria <jechever...@splunk.com> > wrote: > > Thanks or the tips Gary and Vino. I’ll try to reproduce it with test data > and see if I can post some logs. > > I’ll also watch the leader znode to see if the election isn’t happening or > if it’s not being retrieved. > > Thanks! > > -Joey > > On Aug 1, 2018, at 11:19 PM, Gary Yao <g...@data-artisans.com> wrote: > > Hi Joey, > > If the other components (e.g., Dispatcher, ResourceManager) are able to > finish > the leader election in a timely manner, I currently do not see a reason > why it > should take the REST server 20 - 45 minutes. > > You can check the contents of znode /flink/.../leader/rest_server_lock to > see > if there is indeed no leader, or if the leader information cannot be > retrieved > from ZooKeeper. > > If you can reproduce this in a staging environment with some test jobs, I'd > like to see the ClusterEntrypoint/JobManager logs (perhaps on debug level). > > Best, > Gary > > On Mon, Jul 30, 2018 at 8:10 PM, Joey Echeverria <jechever...@splunk.com> > wrote: > >> I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single >> Job Manager running. I’m using Zookeeper to store the fencing/leader >> information and S3 to store the job manager state. We’ve been running >> around 250 or so streaming jobs and we’ve noticed that if the job manager >> pod is deleted, it takes something like 20-45 minutes for the job manager’s >> REST endpoints and web UI to become available. Until it becomes available, >> we get a 503 response from the HTTP server with the message "Could not >> retrieve the redirect address of the current leader. Please try to >> refresh.”. >> >> Has anyone else run into this? >> >> Are there any configuration settings I should be looking at to speed up >> the availability of the HTTP endpoints? >> >> Thanks! >> >> -Joey > > > > >