This might be related with FLINK-10052[1].
Unfortunately, we do not have any progress on this ticket.

cc @Till Rohrmann <trohrm...@apache.org>

Best,
Yang

chenqin <qinnc...@gmail.com> 于2021年4月13日周二 上午7:31写道:

> Hi there,
>
> We observed several 1.11 job running in 1.11 restart due to job leader
> lost.
> Dig deeper, the issue seems related to SUSPENDED state handler in
> ZooKeeperLeaderRetrievalService.
>
> ASFAIK, suspended state is expected when zk is not certain if leader is
> still alive. It can follow up with RECONNECT or LOST. In current
> implementation [1] , we treat suspended state same as lost state and
> actively shutdown job. This pose stability issue on large HA setting.
>
> My question is can we get some insight behind this decision and could we
> add
> some tunable configuration for user to decide how long they can endure such
> uncertain suspended state in their jobs.
>
> Thanks,
> Chen
>
> [1]
>
> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>
>
>
>
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>

Reply via email to