Hi Joey, your analysis is correct. Currently, the Dispatcher will first try to recover all jobs before it confirms the leadership.
1) The Dispatcher provides much of the relevant information you see in the web-ui. Without a leading Dispatcher, the web-ui cannot show much information. But this could also be changed such that in the situation where no Dispatcher is a leader, we cannot display certain information (number of running jobs, job details, etc.). Could you create a JIRA issue to fix this problem? 2) The reason why the Dispatcher first tries to recover the jobs before confirming the leadership is because it first tries to restore its internal state before it is accessible by other components and, thus, state changes. For example, the following problem could arise: Assume that you submit a job to the cluster. The cluster receives the JobGraph and persists it in ZooKeeper. Before the Dispatcher can acknowledge the job submission it fails. The client sees the failure and tries to re-submit the job. Now the Dispatcher is restarted and starts recovering the persisted jobs. If we don't wait for this to complete, then the retried job submission could succeed first because it is just faster. This would, however, let the job recovery fail because the Dispatcher is already executing this job (due to the re-submission) and the assumption is that recovered jobs are submitted first. The same applies if you should submit a modified job with the same JobID as a persisted job. Which job should the system then execute? The old one or the newly submitted job. By waiting to first complete the recovery, we give precedence to the persisted jobs. One could solve this problem also slightly differently, by only blocking the job submission while a recovery is happening. However, one should check that no other RPCs change the internal state in such a way that it interferes with the job recovery. Could you maybe open a JIRA issue for solving this problem? 3) The job recovery is mainly limited by the connection to your persistent storage system (HDFS or S3 I assume) where the JobGraphs are stored. Alternatively, you could split the number of executed jobs across multiple Flink clusters in order to decrease the number of jobs which need to be recovered in case of a failure. Thanks a lot for reporting and analysing this problem. This is definitely something we should improve! Cheers, Till On Fri, Aug 3, 2018 at 5:48 AM vino yang <yanghua1...@gmail.com> wrote: > Hi Joey, > > Good question! > I will copy it to Till and Chesnay who know this part of the > implementation. > > Thanks, vino. > > 2018-08-03 11:09 GMT+08:00 Joey Echeverria <jechever...@splunk.com>: > >> I don’t have logs available yet, but I do have some information from ZK. >> >> The culprit appears to be the /flink/default/leader/dispatcher_lock znode. >> >> I took a look at the dispatcher code here: >> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L762-L785 >> >> And it looks to me that when leadership is granted it will perform job >> recovery on all jobs before it writes the new leader information to >> the /flink/default/leader/dispatcher_lock znode. >> >> So this leaves me with three questions: >> >> 1) Why does the web monitor specifically have to wait for the dispatcher? >> 2) Is there a reason why the dispatcher can’t write the lock until after >> job recovery? >> 3) Is there anything I can/should be doing to speed up job recovery? >> >> Thanks! >> >> -Joey >> >> >> On Aug 2, 2018, at 9:24 AM, Joey Echeverria <jechever...@splunk.com> >> wrote: >> >> Thanks or the tips Gary and Vino. I’ll try to reproduce it with test data >> and see if I can post some logs. >> >> I’ll also watch the leader znode to see if the election isn’t happening >> or if it’s not being retrieved. >> >> Thanks! >> >> -Joey >> >> On Aug 1, 2018, at 11:19 PM, Gary Yao <g...@data-artisans.com> wrote: >> >> Hi Joey, >> >> If the other components (e.g., Dispatcher, ResourceManager) are able to >> finish >> the leader election in a timely manner, I currently do not see a reason >> why it >> should take the REST server 20 - 45 minutes. >> >> You can check the contents of znode /flink/.../leader/rest_server_lock to >> see >> if there is indeed no leader, or if the leader information cannot be >> retrieved >> from ZooKeeper. >> >> If you can reproduce this in a staging environment with some test jobs, >> I'd >> like to see the ClusterEntrypoint/JobManager logs (perhaps on debug >> level). >> >> Best, >> Gary >> >> On Mon, Jul 30, 2018 at 8:10 PM, Joey Echeverria <jechever...@splunk.com> >> wrote: >> >>> I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single >>> Job Manager running. I’m using Zookeeper to store the fencing/leader >>> information and S3 to store the job manager state. We’ve been running >>> around 250 or so streaming jobs and we’ve noticed that if the job manager >>> pod is deleted, it takes something like 20-45 minutes for the job manager’s >>> REST endpoints and web UI to become available. Until it becomes available, >>> we get a 503 response from the HTTP server with the message "Could not >>> retrieve the redirect address of the current leader. Please try to >>> refresh.”. >>> >>> Has anyone else run into this? >>> >>> Are there any configuration settings I should be looking at to speed up >>> the availability of the HTTP endpoints? >>> >>> Thanks! >>> >>> -Joey >> >> >> >> >> >