Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

tison Thu, 22 Apr 2021 22:16:28 -0700

Could you show the log about which unhandled exception was thrown?

Best,
tison.



Chen Qin <[email protected]> 于2021年4月23日周五 下午1:06写道：

> Hi Tison,
>
> Please read my latest comments in the thread. Using SessionErrorPolicy
> mitigated the suspended state issue while it might trigger an unhandled zk
> client exception in some situations. We would like to get some idea of the
> root cause of that issue to avoid introducing another issue in the fix.
>
> Chen
>
>
> On Thu, Apr 22, 2021 at 10:04 AM tison <[email protected]> wrote:
>
> > > My question is can we get some insight behind this decision and could
> we
> > add
> > some tunable configuration for user to decide how long they can endure
> such
> > uncertain suspended state in their jobs.
> >
> > For the specific question, Curator provides a configure for session
> timeout
> > and a
> > LOST will be generated if disconnected elapsed longer then the configured
> > timeout.
> >
> >
> >
> https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102
> >
> >
> > Best,
> > tison.
> >
> >
> > tison <[email protected]> 于2021年4月23日周五 上午12:57写道：
> >
> > > To be concrete, if ZK suspended and reconnected, NodeCache already do
> > > the reset work for you and if there is a leader epoch updated, fencing
> > > token
> > > a.k.a leader session id would be updated so you will notice it.
> > >
> > > If ZK permanently lost, I think it is a system-wise fault and you'd
> > better
> > > restart
> > > the job from checkpoint/savepoint with a working ZK ensemble.
> > >
> > > I am possibly concluding without more detailed investigation though.
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > tison <[email protected]> 于2021年4月23日周五 上午12:35写道：
> > >
> > >> > Unfortunately, we do not have any progress on this ticket.
> > >>
> > >> Here is a PR[1].
> > >>
> > >> Here is the base PR[2] I made about one year ago without following
> > review.
> > >>
> > >> [email protected]:
> > >>
> > >> It requires further investigation about the impact involved by
> > >> FLINK-18677[3].
> > >> I do have some comments[4] but so far regard it as a stability problem
> > >> instead of
> > >> correctness problem.
> > >>
> > >> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost
> > FOREVER,
> > >> and I don't want to pay any time before reactions on FLINK-10052
> > otherwise
> > >> it is highly possibly in vain again from my perspective.
> > >>
> > >> Best,
> > >> tison.
> > >>
> > >> [1] https://github.com/apache/flink/pull/15675
> > >> [2] https://github.com/apache/flink/pull/11338
> > >> [3] https://issues.apache.org/jira/browse/FLINK-18677
> > >> [4] https://github.com/apache/flink/pull/13055#discussion_r615871963
> > >>
> > >>
> > >>
> > >> Chen Qin <[email protected]> 于2021年4月23日周五 上午12:15写道：
> > >>
> > >>> Hi there,
> > >>>
> > >>> Quick dial back here, we have been running load testing and so far
> > >>> haven't
> > >>> seen suspended state cause job restarts.
> > >>>
> > >>> Some findings, instead of curator framework capture suspended state
> and
> > >>> active notify leader lost, we have seen task manager propagate
> > unhandled
> > >>> errors from zk client, most likely due to
> > >>> high-availability.zookeeper.client.max-retry-attempts
> > >>> were set to 3 and with 5 seconds interval. It would be great if we
> > handle
> > >>> this exception gracefully with a meaningful exception message. Those
> > >>> error
> > >>> messages happen when other task managers die due to user code
> > exceptions,
> > >>> we would like to know more insights on this as well.
> > >>>
> > >>> For more context, Lu from our team also filed [2] stating issue with
> > 1.9,
> > >>> so far we haven't seen regression on ongoing load testing jobs.
> > >>>
> > >>> Thanks,
> > >>> Chen
> > >>>
> > >>> Caused by:
> > >>> >
> > >>>
> >
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
> > >>> > KeeperErrorCode = ConnectionLoss
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> > >>> > at
> > >>> >
> > >>>
> >
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
> > >>>
> > >>>
> > >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> > >>> [2] https://issues.apache.org/jira/browse/FLINK-19985
> > >>>
> > >>>
> > >>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <[email protected]>
> > wrote:
> > >>>
> > >>> > Thanks for trying the unfinished PR and sharing the testing
> results.
> > >>> Glad
> > >>> > to here that it could work
> > >>> > and really hope the result of more stringent load testing.
> > >>> >
> > >>> > After then I think we could revive this ticket.
> > >>> >
> > >>> >
> > >>> > Best,
> > >>> > Yang
> > >>> >
> > >>> > Chen Qin <[email protected]> 于2021年4月16日周五 上午2:01写道：
> > >>> >
> > >>> >> Hi there,
> > >>> >>
> > >>> >> Thanks for providing points to related changes and jira. Some
> > updates
> > >>> >> from our side, we applied a path by merging FLINK-10052
> > >>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with master
> as
> > >>> well
> > >>> >> as only handling lost state leveraging
> > >>> SessionConnectionStateErrorPolicy
> > >>> >>   FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>
> > >>> >>  introduced.
> > >>> >>
> > >>> >> Preliminary results were good, the same workload (240 TM) on the
> > same
> > >>> >> environment runs stable without frequent restarts due to suspended
> > >>> state
> > >>> >> (seems false positive). We are working on more stringent load
> > testing
> > >>> as
> > >>> >> well as chaos testing (blocking zk). Will keep folks posted.
> > >>> >>
> > >>> >> Thanks,
> > >>> >> Chen
> > >>> >>
> > >>> >>
> > >>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <
> [email protected]
> > >
> > >>> >> wrote:
> > >>> >>
> > >>> >>> Hi Chenqin,
> > >>> >>>
> > >>> >>> The current rationale behind assuming a leadership loss when
> > seeing a
> > >>> >>> SUSPENDED connection is to assume the worst and to be on the safe
> > >>> side.
> > >>> >>>
> > >>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the
> > >>> behaviour
> > >>> >>> configurable. Unfortunately, the community did not have enough
> time
> > >>> to
> > >>> >>> complete this feature.
> > >>> >>>
> > >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> > >>> >>>
> > >>> >>> Cheers,
> > >>> >>> Till
> > >>> >>>
> > >>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <[email protected]
> >
> > >>> wrote:
> > >>> >>>
> > >>> >>> > This might be related with FLINK-10052[1].
> > >>> >>> > Unfortunately, we do not have any progress on this ticket.
> > >>> >>> >
> > >>> >>> > cc @Till Rohrmann <[email protected]>
> > >>> >>> >
> > >>> >>> > Best,
> > >>> >>> > Yang
> > >>> >>> >
> > >>> >>> > chenqin <[email protected]> 于2021年4月13日周二 上午7:31写道：
> > >>> >>> >
> > >>> >>> >> Hi there,
> > >>> >>> >>
> > >>> >>> >> We observed several 1.11 job running in 1.11 restart due to
> job
> > >>> leader
> > >>> >>> >> lost.
> > >>> >>> >> Dig deeper, the issue seems related to SUSPENDED state handler
> > in
> > >>> >>> >> ZooKeeperLeaderRetrievalService.
> > >>> >>> >>
> > >>> >>> >> ASFAIK, suspended state is expected when zk is not certain if
> > >>> leader
> > >>> >>> is
> > >>> >>> >> still alive. It can follow up with RECONNECT or LOST. In
> current
> > >>> >>> >> implementation [1] , we treat suspended state same as lost
> state
> > >>> and
> > >>> >>> >> actively shutdown job. This pose stability issue on large HA
> > >>> setting.
> > >>> >>> >>
> > >>> >>> >> My question is can we get some insight behind this decision
> and
> > >>> could
> > >>> >>> we
> > >>> >>> >> add
> > >>> >>> >> some tunable configuration for user to decide how long they
> can
> > >>> endure
> > >>> >>> >> such
> > >>> >>> >> uncertain suspended state in their jobs.
> > >>> >>> >>
> > >>> >>> >> Thanks,
> > >>> >>> >> Chen
> > >>> >>> >>
> > >>> >>> >> [1]
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>>
> > >>>
> >
> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> --
> > >>> >>> >> Sent from:
> > >>> >>> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> > >>> >>> >>
> > >>> >>> >
> > >>> >>>
> > >>> >>
> > >>>
> > >>
> >
>

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Reply via email to