?????? [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

???????? Tue, 23 Jul 2019 06:28:17 -0700

Ok, If you have any suggestions, we can talk aobut the details under 
FLINK-10052.



Best.


------------------ ???????? ------------------
??????: "Till Rohrmann"<trohrm...@apache.org>;
????????: 2019??7??23??(??????) ????9:19
??????: "dev"<dev@flink.apache.org>;

????: Re: [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections



Hi Lamber-Ken,

thanks for starting this discussion. I think there is benefit of not
directly losing leadership if the ZooKeeper connection goes into the
SUSPENDED state. In particular if we can guarantee that there is only a
single JobMaster, it might make sense to not overly eagerly give up
leadership. I would suggest to continue the technical discussion on the
JIRA issue thread since it already contains a good amount of details.

Cheers,
Till

On Sat, Jul 20, 2019 at 12:55 PM QQ???? <2217232...@qq.com> wrote:

> Hi All,
>
> Desc
> We deploy flink streaming jobs on hadoop cluster on per-job model and use
> zookeeper as HighAvailabilityService, but we found that flink job will
> restart because of the network disconnected temporarily between jobmanager
> and zookeeper.So we analyze this problem deeply. Flink JobManager use
> curator's `LeaderLatch` to maintain the leadership. When network
> disconncet, the `LeaderLatch` will change leadership to false directly. We
> think it's too brutally that many flink longrunning jobs will restart
> because of the network shake.Instead of directly revoking the leadership
> upon a SUSPENDED ZooKeeper connection, it would be better to wait until the
> ZooKeeper connection is LOST.
>
> Here're two jiras about the problem, FLINK-10052 and FLINK-13189, they are
> duplicate. Thanks to @Elias Levy told us that FLINK-13189, so close
> FLINK-13189.
>
> Solution
> Back to this problem, there're two ways to solve this currently, one is
> rewrite LeaderLatch#handleStateChange method, another is upgrade
> curator-4.2.0. The first way is hackly but right, the second way need to
> consider the
> compatibility. For more detail, please see FLINK-10052.
>
> Hope
> The FLINK-10052 was reported at 2018-08-03(about a year ago), so we hope
> this problem can fix as soon as possible.
> btw, thanks @TisonKun for talking about this problem and review pr.
>
> Links
> FLINK-10052 https://issues.apache.org/jira/browse/FLINK-10052 <
> https://issues.apache.org/jira/browse/FLINK-10052>
> FLINK-13189 https://issues.apache.org/jira/browse/FLINK-13189 <
> https://issues.apache.org/jira/browse/FLINK-13189>
>
> Any suggestion is welcome, what do you think?
>
> Best, lamber-ken.

?????? [DISSCUSS] Tolerate temporarily suspended ZooKeeper connections

Reply via email to