Re: Delay in REST/UI readiness during JM recovery

vino yang Thu, 02 Aug 2018 20:48:29 -0700

Hi Joey,

Good question!
I will copy it to Till and Chesnay who know this part of the implementation.


Thanks, vino.

2018-08-03 11:09 GMT+08:00 Joey Echeverria <jechever...@splunk.com>:

> I don’t have logs available yet, but I do have some information from ZK.
>
> The culprit appears to be the /flink/default/leader/dispatcher_lock znode.
>
> I took a look at the dispatcher code here: https://github.com/
> apache/flink/blob/master/flink-runtime/src/main/java/
> org/apache/flink/runtime/dispatcher/Dispatcher.java#L762-L785
>
> And it looks to me that when leadership is granted it will perform job
> recovery on all jobs before it writes the new leader information to
> the /flink/default/leader/dispatcher_lock znode.
>
> So this leaves me with three questions:
>
> 1) Why does the web monitor specifically have to wait for the dispatcher?
> 2) Is there a reason why the dispatcher can’t write the lock until after
> job recovery?
> 3) Is there anything I can/should be doing to speed up job recovery?
>
> Thanks!
>
> -Joey
>
>
> On Aug 2, 2018, at 9:24 AM, Joey Echeverria <jechever...@splunk.com>
> wrote:
>
> Thanks or the tips Gary and Vino. I’ll try to reproduce it with test data
> and see if I can post some logs.
>
> I’ll also watch the leader znode to see if the election isn’t happening or
> if it’s not being retrieved.
>
> Thanks!
>
> -Joey
>
> On Aug 1, 2018, at 11:19 PM, Gary Yao <g...@data-artisans.com> wrote:
>
> Hi Joey,
>
> If the other components (e.g., Dispatcher, ResourceManager) are able to
> finish
> the leader election in a timely manner, I currently do not see a reason
> why it
> should take the REST server 20 - 45 minutes.
>
> You can check the contents of znode /flink/.../leader/rest_server_lock to
> see
> if there is indeed no leader, or if the leader information cannot be
> retrieved
> from ZooKeeper.
>
> If you can reproduce this in a staging environment with some test jobs, I'd
> like to see the ClusterEntrypoint/JobManager logs (perhaps on debug level).
>
> Best,
> Gary
>
> On Mon, Jul 30, 2018 at 8:10 PM, Joey Echeverria <jechever...@splunk.com>
> wrote:
>
>> I’m running Flink 1.5.0 in Kubernetes with HA enabled, but only a single
>> Job Manager running. I’m using Zookeeper to store the fencing/leader
>> information and S3 to store the job manager state. We’ve been running
>> around 250 or so streaming jobs and we’ve noticed that if the job manager
>> pod is deleted, it takes something like 20-45 minutes for the job manager’s
>> REST endpoints and web UI to become available. Until it becomes available,
>> we get a 503 response from the HTTP server with the message "Could not
>> retrieve the redirect address of the current leader. Please try to
>> refresh.”.
>>
>> Has anyone else run into this?
>>
>> Are there any configuration settings I should be looking at to speed up
>> the availability of the HTTP endpoints?
>>
>> Thanks!
>>
>> -Joey
>
>
>
>
>

Re: Delay in REST/UI readiness during JM recovery

Reply via email to