Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Till Rohrmann Mon, 29 Jul 2019 06:12:55 -0700

Hi Tison,

I would consider this a new feature and as such it won't be possible to
include it in the 1.9.0 release since the feature freeze has been passed.
We might target 1.10, though.


Cheers,
Till

On Mon, Jul 29, 2019 at 3:01 AM Zili Chen <[email protected]> wrote:

> Hi committers,
>
> Now that we have an ongoing pr[1] to this JIRA, we need a committer
> to push this thread forward. It would be glad to see this issue fixed
> in 1.9.0.
>
> Best,
> tison.
>
> [1] https://github.com/apache/flink/pull/9158
>
>
> 未来阳光 <[email protected]> 于2019年7月23日周二 下午9:28写道：
>
> > Ok, If you have any suggestions, we can talk aobut the details under
> > FLINK-10052.
> >
> >
> > Best.
> >
> >
> > ------------------ 原始邮件 ------------------
> > 发件人: "Till Rohrmann"<[email protected]>;
> > 发送时间: 2019年7月23日(星期二) 晚上9:19
> > 收件人: "dev"<[email protected]>;
> >
> > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections
> >
> >
> >
> > Hi Lamber-Ken,
> >
> > thanks for starting this discussion. I think there is benefit of not
> > directly losing leadership if the ZooKeeper connection goes into the
> > SUSPENDED state. In particular if we can guarantee that there is only a
> > single JobMaster, it might make sense to not overly eagerly give up
> > leadership. I would suggest to continue the technical discussion on the
> > JIRA issue thread since it already contains a good amount of details.
> >
> > Cheers,
> > Till
> >
> > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <[email protected]> wrote:
> >
> > > Hi All,
> > >
> > > Desc
> > > We deploy flink streaming jobs on hadoop cluster on per-job model and
> use
> > > zookeeper as HighAvailabilityService, but we found that flink job will
> > > restart because of the network disconnected temporarily between
> > jobmanager
> > > and zookeeper.So we analyze this problem deeply. Flink JobManager use
> > > curator's `LeaderLatch` to maintain the leadership. When network
> > > disconncet, the `LeaderLatch` will change leadership to false directly.
> > We
> > > think it's too brutally that many flink longrunning jobs will restart
> > > because of the network shake.Instead of directly revoking the
> leadership
> > > upon a SUSPENDED ZooKeeper connection, it would be better to wait until
> > the
> > > ZooKeeper connection is LOST.
> > >
> > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they
> > are
> > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close
> > > FLINK-13189.
> > >
> > > Solution
> > > Back to this problem, there're two ways to solve this currently, one is
> > > rewrite LeaderLatch#handleStateChange method, another is upgrade
> > > curator-4.2.0. The first way is hackly but right, the second way need
> to
> > > consider the
> > > compatibility. For more detail, please see FLINK-10052.
> > >
> > > Hope
> > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we
> hope
> > > this problem can fix as soon as possible.
> > > btw, thanks @TisonKun for talking about this problem and review pr.
> > >
> > > Links
> > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <
> > > https://issues.apache.org/jira/browse/FLINK-10052>
> > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <
> > > https://issues.apache.org/jira/browse/FLINK-13189>
> > >
> > > Any suggestion is welcome, what do you think?
> > >
> > > Best, lamber-ken.
>

Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Reply via email to