Hi All,

Desc
We deploy flink streaming jobs on hadoop cluster on per-job model and use 
zookeeper as HighAvailabilityService, but we found that flink job will restart 
because of the network disconnected temporarily between jobmanager and 
zookeeper.So we analyze this problem deeply. Flink JobManager use curator's 
`LeaderLatch` to maintain the leadership. When network disconncet, the 
`LeaderLatch` will change leadership to false directly. We think it's too 
brutally that many flink longrunning jobs will restart because of the network 
shake.Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
connection, it would be better to wait until the ZooKeeper connection is LOST.

Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are 
duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close FLINK-13189.

Solution
Back to this problem, there're two ways to solve this currently, one is rewrite 
LeaderLatch#handleStateChange method, another is upgrade curator-4.2.0. The 
first way is hackly but right, the second way need to consider the 
compatibility. For more detail, please see FLINK-10052.

Hope
The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope this 
problem can fix as soon as possible. 
btw, thanks @TisonKun for talking about this problem and review pr.

Links
FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 
<https://issues.apache.org/jira/browse/FLINK-10052>
FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 
<https://issues.apache.org/jira/browse/FLINK-13189>

Any suggestion is welcome, what do you think? 

Best, lamber-ken.

Reply via email to