Hi All, Desc We deploy flink streaming jobs on hadoop cluster on per-job model and use zookeeper as HighAvailabilityService, but we found that flink job will restart because of the network disconnected temporarily between jobmanager and zookeeper.So we analyze this problem deeply. Flink JobManager use curator's `LeaderLatch` to maintain the leadership. When network disconncet, the `LeaderLatch` will change leadership to false directly. We think it's too brutally that many flink longrunning jobs will restart because of the network shake.Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper connection, it would be better to wait until the ZooKeeper connection is LOST.
Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close FLINK-13189. Solution Back to this problem, there're two ways to solve this currently, one is rewrite LeaderLatch#handleStateChange method, another is upgrade curator-4.2.0. The first way is hackly but right, the second way need to consider the compatibility. For more detail, please see FLINK-10052. Hope The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope this problem can fix as soon as possible. btw, thanks @TisonKun for talking about this problem and review pr. Links FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052> FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <https://issues.apache.org/jira/browse/FLINK-13189> Any suggestion is welcome, what do you think? Best, lamber-ken.