Nice topic, our flink jobs met this problem too, and I think this work can help us deal with it.
On 2019/07/20 10:55:23, QQ邮箱 <2...@qq.com> wrote: > Hi All,> > > Desc> > We deploy flink streaming jobs on hadoop cluster on per-job model and use > zookeeper as HighAvailabilityService, but we found that flink job will > restart because of the network disconnected temporarily between jobmanager > and zookeeper.So we analyze this problem deeply. Flink JobManager use > curator's `LeaderLatch` to maintain the leadership. When network disconncet, > the `LeaderLatch` will change leadership to false directly. We think it's too > brutally that many flink longrunning jobs will restart because of the network > shake.Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST.> > > Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are > duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close > FLINK-13189.> > > Solution> > Back to this problem, there're two ways to solve this currently, one is > rewrite LeaderLatch#handleStateChange method, another is upgrade > curator-4.2.0. The first way is hackly but right, the second way need to > consider the > > compatibility. For more detail, please see FLINK-10052.> > > Hope> > The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope this > problem can fix as soon as possible. > > btw, thanks @TisonKun for talking about this problem and review pr.> > > Links> > FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 > <https://issues.apache.org/jira/browse/FLINK-10052>> > FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 > <https://issues.apache.org/jira/browse/FLINK-13189>> > > Any suggestion is welcome, what do you think? > > > Best, lamber-ken.>