Hi Till, Thanks for your explanation. Let's pick up this thread in 1.10 developing.
Best, tison. Till Rohrmann <trohrm...@apache.org> 于2019年7月29日周一 下午9:12写道: > Hi Tison, > > I would consider this a new feature and as such it won't be possible to > include it in the 1.9.0 release since the feature freeze has been passed. > We might target 1.10, though. > > Cheers, > Till > > On Mon, Jul 29, 2019 at 3:01 AM Zili Chen <wander4...@gmail.com> wrote: > > > Hi committers, > > > > Now that we have an ongoing pr[1] to this JIRA, we need a committer > > to push this thread forward. It would be glad to see this issue fixed > > in 1.9.0. > > > > Best, > > tison. > > > > [1] https://github.com/apache/flink/pull/9158 > > > > > > 未来阳光 <2217232...@qq.com> 于2019年7月23日周二 下午9:28写道: > > > > > Ok, If you have any suggestions, we can talk aobut the details under > > > FLINK-10052. > > > > > > > > > Best. > > > > > > > > > ------------------ 原始邮件 ------------------ > > > 发件人: "Till Rohrmann"<trohrm...@apache.org>; > > > 发送时间: 2019年7月23日(星期二) 晚上9:19 > > > 收件人: "dev"<dev@flink.apache.org>; > > > > > > 主题: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections > > > > > > > > > > > > Hi Lamber-Ken, > > > > > > thanks for starting this discussion. I think there is benefit of not > > > directly losing leadership if the ZooKeeper connection goes into the > > > SUSPENDED state. In particular if we can guarantee that there is only a > > > single JobMaster, it might make sense to not overly eagerly give up > > > leadership. I would suggest to continue the technical discussion on the > > > JIRA issue thread since it already contains a good amount of details. > > > > > > Cheers, > > > Till > > > > > > On Sat, Jul 20, 2019 at 12:55 PM QQ邮箱 <2217232...@qq.com> wrote: > > > > > > > Hi All, > > > > > > > > Desc > > > > We deploy flink streaming jobs on hadoop cluster on per-job model and > > use > > > > zookeeper as HighAvailabilityService, but we found that flink job > will > > > > restart because of the network disconnected temporarily between > > > jobmanager > > > > and zookeeper.So we analyze this problem deeply. Flink JobManager use > > > > curator's `LeaderLatch` to maintain the leadership. When network > > > > disconncet, the `LeaderLatch` will change leadership to false > directly. > > > We > > > > think it's too brutally that many flink longrunning jobs will restart > > > > because of the network shake.Instead of directly revoking the > > leadership > > > > upon a SUSPENDED ZooKeeper connection, it would be better to wait > until > > > the > > > > ZooKeeper connection is LOST. > > > > > > > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, > they > > > are > > > > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close > > > > FLINK-13189. > > > > > > > > Solution > > > > Back to this problem, there're two ways to solve this currently, one > is > > > > rewrite LeaderLatch#handleStateChange method, another is upgrade > > > > curator-4.2.0. The first way is hackly but right, the second way need > > to > > > > consider the > > > > compatibility. For more detail, please see FLINK-10052. > > > > > > > > Hope > > > > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we > > hope > > > > this problem can fix as soon as possible. > > > > btw, thanks @TisonKun for talking about this problem and review pr. > > > > > > > > Links > > > > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 < > > > > https://issues.apache.org/jira/browse/FLINK-10052> > > > > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 < > > > > https://issues.apache.org/jira/browse/FLINK-13189> > > > > > > > > Any suggestion is welcome, what do you think? > > > > > > > > Best, lamber-ken. > > >