Nice topic, our flink jobs met this problem too, and I think this work can help 
us deal with it.

On 2019/07/20 10:55:23, QQ邮箱 <2...@qq.com> wrote: 
> Hi All,> 
> 
> Desc> 
> We deploy flink streaming jobs on hadoop cluster on per-job model and use 
> zookeeper as HighAvailabilityService, but we found that flink job will 
> restart because of the network disconnected temporarily between jobmanager 
> and zookeeper.So we analyze this problem deeply. Flink JobManager use 
> curator's `LeaderLatch` to maintain the leadership. When network disconncet, 
> the `LeaderLatch` will change leadership to false directly. We think it's too 
> brutally that many flink longrunning jobs will restart because of the network 
> shake.Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST.> 
> 
> Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are 
> duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close 
> FLINK-13189.> 
> 
> Solution> 
> Back to this problem, there're two ways to solve this currently, one is 
> rewrite LeaderLatch#handleStateChange method, another is upgrade 
> curator-4.2.0. The first way is hackly but right, the second way need to 
> consider the > 
> compatibility. For more detail, please see FLINK-10052.> 
> 
> Hope> 
> The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope this 
> problem can fix as soon as possible. > 
> btw, thanks @TisonKun for talking about this problem and review pr.> 
> 
> Links> 
> FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 
> <https://issues.apache.org/jira/browse/FLINK-10052>> 
> FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 
> <https://issues.apache.org/jira/browse/FLINK-13189>> 
> 
> Any suggestion is welcome, what do you think? > 
> 
> Best, lamber-ken.> 

Reply via email to