Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Zili Chen Sun, 28 Jul 2019 18:01:40 -0700

Hi committers,

Now that we have an ongoing pr[1] to this JIRA, we need a committer
to push this thread forward. It would be glad to see this issue fixed
in 1.9.0.


Best,
tison.

[1] https://github.com/apache/flink/pull/9158


未来阳光 <[email protected]> 于2019年7月23日周二 下午9:28写道：

> Ok, If you have any suggestions, we can talk aobut the details under
> FLINK-10052.
>
>
> Best.
>
>
> ------------------ 原始邮件 ------------------
> 发件人: "Till Rohrmann"<[email protected]>;
> 发送时间: 2019年7月23日(星期二) 晚上9:19
> 收件人: "dev"<[email protected]>;
>
> 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections
>
>
>
> Hi Lamber-Ken,
>
> thanks for starting this discussion. I think there is benefit of not
> directly losing leadership if the ZooKeeper connection goes into the
> SUSPENDED state. In particular if we can guarantee that there is only a
> single JobMaster, it might make sense to not overly eagerly give up
> leadership. I would suggest to continue the technical discussion on the
> JIRA issue thread since it already contains a good amount of details.
>
> Cheers,
> Till
>
> On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[email protected]> wrote:
>
> > Hi All,
> >
> > Desc
> > We deploy flink streaming jobs on hadoop cluster on per-job model and use
> > zookeeper as HighAvailabilityService, but we found that flink job will
> > restart because of the network disconnected temporarily between
> jobmanager
> > and zookeeper.So we analyze this problem deeply. Flink JobManager use
> > curator's `LeaderLatch` to maintain the leadership. When network
> > disconncet, the `LeaderLatch` will change leadership to false directly.
> We
> > think it's too brutally that many flink longrunning jobs will restart
> > because of the network shake.Instead of directly revoking the leadership
> > upon a SUSPENDED ZooKeeper connection, it would be better to wait until
> the
> > ZooKeeper connection is LOST.
> >
> > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they
> are
> > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close
> > FLINK-13189.
> >
> > Solution
> > Back to this problem, there're two ways to solve this currently, one is
> > rewrite LeaderLatch#handleStateChange method, another is upgrade
> > curator-4.2.0. The first way is hackly but right, the second way need to
> > consider the
> > compatibility. For more detail, please see FLINK-10052.
> >
> > Hope
> > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope
> > this problem can fix as soon as possible.
> > btw, thanks @TisonKun for talking about this problem and review pr.
> >
> > Links
> > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <
> > https://issues.apache.org/jira/browse/FLINK-10052>
> > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <
> > https://issues.apache.org/jira/browse/FLINK-13189>
> >
> > Any suggestion is welcome, what do you think?
> >
> > Best, lamber-ken.

Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Reply via email to