Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Zili Chen Mon, 29 Jul 2019 08:07:30 -0700

Hi Till,

Thanks for your explanation. Let's pick up this thread in 1.10 developing.


Best,
tison.


Till Rohrmann <[email protected]> 于2019年7月29日周一 下午9:12写道：

> Hi Tison,
>
> I would consider this a new feature and as such it won't be possible to
> include it in the 1.9.0 release since the feature freeze has been passed.
> We might target 1.10, though.
>
> Cheers,
> Till
>
> On Mon, Jul 29, 2019 at 3:01 AM Zili Chen <[email protected]> wrote:
>
> > Hi committers,
> >
> > Now that we have an ongoing pr[1] to this JIRA, we need a committer
> > to push this thread forward. It would be glad to see this issue fixed
> > in 1.9.0.
> >
> > Best,
> > tison.
> >
> > [1] https://github.com/apache/flink/pull/9158
> >
> >
> > 未来阳光 <[email protected]> 于2019年7月23日周二 下午9:28写道：
> >
> > > Ok, If you have any suggestions, we can talk aobut the details under
> > > FLINK-10052.
> > >
> > >
> > > Best.
> > >
> > >
> > > ------------------ 原始邮件 ------------------
> > > 发件人: "Till Rohrmann"<[email protected]>;
> > > 发送时间: 2019年7月23日(星期二) 晚上9:19
> > > 收件人: "dev"<[email protected]>;
> > >
> > > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections
> > >
> > >
> > >
> > > Hi Lamber-Ken,
> > >
> > > thanks for starting this discussion. I think there is benefit of not
> > > directly losing leadership if the ZooKeeper connection goes into the
> > > SUSPENDED state. In particular if we can guarantee that there is only a
> > > single JobMaster, it might make sense to not overly eagerly give up
> > > leadership. I would suggest to continue the technical discussion on the
> > > JIRA issue thread since it already contains a good amount of details.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[email protected]> wrote:
> > >
> > > > Hi All,
> > > >
> > > > Desc
> > > > We deploy flink streaming jobs on hadoop cluster on per-job model and
> > use
> > > > zookeeper as HighAvailabilityService, but we found that flink job
> will
> > > > restart because of the network disconnected temporarily between
> > > jobmanager
> > > > and zookeeper.So we analyze this problem deeply. Flink JobManager use
> > > > curator's `LeaderLatch` to maintain the leadership. When network
> > > > disconncet, the `LeaderLatch` will change leadership to false
> directly.
> > > We
> > > > think it's too brutally that many flink longrunning jobs will restart
> > > > because of the network shake.Instead of directly revoking the
> > leadership
> > > > upon a SUSPENDED ZooKeeper connection, it would be better to wait
> until
> > > the
> > > > ZooKeeper connection is LOST.
> > > >
> > > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189,
> they
> > > are
> > > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close
> > > > FLINK-13189.
> > > >
> > > > Solution
> > > > Back to this problem, there're two ways to solve this currently, one
> is
> > > > rewrite LeaderLatch#handleStateChange method, another is upgrade
> > > > curator-4.2.0. The first way is hackly but right, the second way need
> > to
> > > > consider the
> > > > compatibility. For more detail, please see FLINK-10052.
> > > >
> > > > Hope
> > > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we
> > hope
> > > > this problem can fix as soon as possible.
> > > > btw, thanks @TisonKun for talking about this problem and review pr.
> > > >
> > > > Links
> > > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <
> > > > https://issues.apache.org/jira/browse/FLINK-10052>
> > > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <
> > > > https://issues.apache.org/jira/browse/FLINK-13189>
> > > >
> > > > Any suggestion is welcome, what do you think?
> > > >
> > > > Best, lamber-ken.
> >
>

Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Reply via email to