Hi committers, Now that we have an ongoing pr[1] to this JIRA, we need a committer to push this thread forward. It would be glad to see this issue fixed in 1.9.0.
Best, tison. [1] https://github.com/apache/flink/pull/9158 未来阳光 <2217232...@qq.com> 于2019年7月23日周二 下午9:28写道: > Ok, If you have any suggestions, we can talk aobut the details under > FLINK-10052. > > > Best. > > > ------------------ 原始邮件 ------------------ > 发件人: "Till Rohrmann"<trohrm...@apache.org>; > 发送时间: 2019年7月23日(星期二) 晚上9:19 > 收件人: "dev"<dev@flink.apache.org>; > > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections > > > > Hi Lamber-Ken, > > thanks for starting this discussion. I think there is benefit of not > directly losing leadership if the ZooKeeper connection goes into the > SUSPENDED state. In particular if we can guarantee that there is only a > single JobMaster, it might make sense to not overly eagerly give up > leadership. I would suggest to continue the technical discussion on the > JIRA issue thread since it already contains a good amount of details. > > Cheers, > Till > > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <2217232...@qq.com> wrote: > > > Hi All, > > > > Desc > > We deploy flink streaming jobs on hadoop cluster on per-job model and use > > zookeeper as HighAvailabilityService, but we found that flink job will > > restart because of the network disconnected temporarily between > jobmanager > > and zookeeper.So we analyze this problem deeply. Flink JobManager use > > curator's `LeaderLatch` to maintain the leadership. When network > > disconncet, the `LeaderLatch` will change leadership to false directly. > We > > think it's too brutally that many flink longrunning jobs will restart > > because of the network shake.Instead of directly revoking the leadership > > upon a SUSPENDED ZooKeeper connection, it would be better to wait until > the > > ZooKeeper connection is LOST. > > > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they > are > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close > > FLINK-13189. > > > > Solution > > Back to this problem, there're two ways to solve this currently, one is > > rewrite LeaderLatch#handleStateChange method, another is upgrade > > curator-4.2.0. The first way is hackly but right, the second way need to > > consider the > > compatibility. For more detail, please see FLINK-10052. > > > > Hope > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope > > this problem can fix as soon as possible. > > btw, thanks @TisonKun for talking about this problem and review pr. > > > > Links > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 < > > https://issues.apache.org/jira/browse/FLINK-10052> > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 < > > https://issues.apache.org/jira/browse/FLINK-13189> > > > > Any suggestion is welcome, what do you think? > > > > Best, lamber-ken.