Ok, If you have any suggestions, we can talk aobut the details under FLINK-10052.
Best. ------------------ ???????? ------------------ ??????: "Till Rohrmann"<trohrm...@apache.org>; ????????: 2019??7??23??(??????) ????9:19 ??????: "dev"<dev@flink.apache.org>; ????: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections Hi Lamber-Ken, thanks for starting this discussion. I think there is benefit of not directly losing leadership if the ZooKeeper connection goes into the SUSPENDED state. In particular if we can guarantee that there is only a single JobMaster, it might make sense to not overly eagerly give up leadership. I would suggest to continue the technical discussion on the JIRA issue thread since it already contains a good amount of details. Cheers, Till On Sat, Jul 20, 2019 at 12:55 PM QQ???? <2217232...@qq.com> wrote: > Hi All, > > Desc > We deploy flink streaming jobs on hadoop cluster on per-job model and use > zookeeper as HighAvailabilityService, but we found that flink job will > restart because of the network disconnected temporarily between jobmanager > and zookeeper.So we analyze this problem deeply. Flink JobManager use > curator's `LeaderLatch` to maintain the leadership. When network > disconncet, the `LeaderLatch` will change leadership to false directly. We > think it's too brutally that many flink longrunning jobs will restart > because of the network shake.Instead of directly revoking the leadership > upon a SUSPENDED ZooKeeper connection, it would be better to wait until the > ZooKeeper connection is LOST. > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close > FLINK-13189. > > Solution > Back to this problem, there're two ways to solve this currently, one is > rewrite LeaderLatch#handleStateChange method, another is upgrade > curator-4.2.0. The first way is hackly but right, the second way need to > consider the > compatibility. For more detail, please see FLINK-10052. > > Hope > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope > this problem can fix as soon as possible. > btw, thanks @TisonKun for talking about this problem and review pr. > > Links > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 < > https://issues.apache.org/jira/browse/FLINK-10052> > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 < > https://issues.apache.org/jira/browse/FLINK-13189> > > Any suggestion is welcome, what do you think? > > Best, lamber-ken.