Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Chen Qin Thu, 15 Apr 2021 11:02:30 -0700

Hi there,

Thanks for providing points to related changes and jira. Some updates from
our side, we applied a path by merging FLINK-10052
<https://issues.apache.org/jira/browse/FLINK-10052> with master as well as
only handling lost state leveraging SessionConnectionStateErrorPolicy
FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052> introduced.


Preliminary results were good, the same workload (240 TM) on the same
environment runs stable without frequent restarts due to suspended state
(seems false positive). We are working on more stringent load testing as
well as chaos testing (blocking zk). Will keep folks posted.

Thanks,
Chen


On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <[email protected]> wrote:

> Hi Chenqin,
>
> The current rationale behind assuming a leadership loss when seeing a
> SUSPENDED connection is to assume the worst and to be on the safe side.
>
> Yang Wang is correct. FLINK-10052 [1] has the goal to make the behaviour
> configurable. Unfortunately, the community did not have enough time to
> complete this feature.
>
> [1] https://issues.apache.org/jira/browse/FLINK-10052
>
> Cheers,
> Till
>
> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[email protected]> wrote:
>
> > This might be related with FLINK-10052[1].
> > Unfortunately, we do not have any progress on this ticket.
> >
> > cc @Till Rohrmann <[email protected]>
> >
> > Best,
> > Yang
> >
> > chenqin <[email protected]> 于2021年4月13日周二 上午7:31写道：
> >
> >> Hi there,
> >>
> >> We observed several 1.11 job running in 1.11 restart due to job leader
> >> lost.
> >> Dig deeper, the issue seems related to SUSPENDED state handler in
> >> ZooKeeperLeaderRetrievalService.
> >>
> >> ASFAIK, suspended state is expected when zk is not certain if leader is
> >> still alive. It can follow up with RECONNECT or LOST. In current
> >> implementation [1] , we treat suspended state same as lost state and
> >> actively shutdown job. This pose stability issue on large HA setting.
> >>
> >> My question is can we get some insight behind this decision and could we
> >> add
> >> some tunable configuration for user to decide how long they can endure
> >> such
> >> uncertain suspended state in their jobs.
> >>
> >> Thanks,
> >> Chen
> >>
> >> [1]
> >>
> >>
> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
> >>
> >>
> >>
> >>
> >> --
> >> Sent from:
> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >>
> >
>

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Reply via email to