[ https://issues.apache.org/jira/browse/FLINK-13189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882289#comment-16882289 ]
lamber-ken commented on FLINK-13189: ------------------------------------ Thanks for remind me that [~elevy]. We met this problem in product env and latest flink version don't handle this also, so we create this jira. We didn't expect it's duplicate when we create. > Fix the impact of zookeeper network disconnect temporarily on flink long > running jobs > ------------------------------------------------------------------------------------- > > Key: FLINK-13189 > URL: https://issues.apache.org/jira/browse/FLINK-13189 > Project: Flink > Issue Type: Bug > Components: Runtime / Network > Affects Versions: 1.8.1 > Reporter: lamber-ken > Assignee: lamber-ken > Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 10m > Remaining Estimate: 0h > > *Issue detail info* > We deploy flink streaming jobs on hadoop cluster on per-job model and use > zookeeper as HighAvailabilityService, but we found that flink job will > restart because of the network was disconnected temporarily between > jobmanager and zookeeper. > So we analyze this problem deeply. Flink JobManager use curator's > `+LeaderLatch+` to maintain the leadership. When network disconncet, the > `+LeaderLatch+` will change leadership to false directly. We think it's too > brutally that many flink longrunning jobs will restart because of the network > shake. > > *Fix this issue* > From curator official website, we found that this issuse was fixed at > curator-3.x.x, but we can't not just change the flink-curator-version(2.12.0) > to 3.x.x because of zk-compatibility. Curator-2.x.x support zookeeper-3.4.x > and zookeeper-3.5.0, curator-3.x.x just compatible with ZooKeeper 3.5.x. > Based on the above considerations, we update `LeaderLatch` at > flink-shaded-curator module. > > *Other* > Any suggestions are webcome, thanks > > *Useful links* > [https://curator.apache.org/zk-compatibility.html] > [https://cwiki.apache.org/confluence/display/CURATOR/Releases] > [http://curator.apache.org/curator-recipes/leader-latch.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)