Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Chen Qin Fri, 23 Apr 2021 08:24:38 -0700

Sure, I update jira with exception info. We could follow up from there
for technical discussions.


https://issues.apache.org/jira/browse/FLINK-10052?focusedCommentId=17330858&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17330858

On Thu, Apr 22, 2021 at 10:17 PM tison <[email protected]> wrote:

> The original log (section) is preferred over rephrasing.
> Best,
> tison.
>
>
> tison <[email protected]> 于2021年4月23日周五 下午1:15写道：
>
> > Could you show the log about which unhandled exception was thrown?
> >
> > Best,
> > tison.
> >
> >
> > Chen Qin <[email protected]> 于2021年4月23日周五 下午1:06写道：
> >
> >> Hi Tison,
> >>
> >> Please read my latest comments in the thread. Using SessionErrorPolicy
> >> mitigated the suspended state issue while it might trigger an unhandled
> zk
> >> client exception in some situations. We would like to get some idea of
> the
> >> root cause of that issue to avoid introducing another issue in the fix.
> >>
> >> Chen
> >>
> >>
> >> On Thu, Apr 22, 2021 at 10:04 AM tison <[email protected]> wrote:
> >>
> >> > > My question is can we get some insight behind this decision and
> could
> >> we
> >> > add
> >> > some tunable configuration for user to decide how long they can endure
> >> such
> >> > uncertain suspended state in their jobs.
> >> >
> >> > For the specific question, Curator provides a configure for session
> >> timeout
> >> > and a
> >> > LOST will be generated if disconnected elapsed longer then the
> >> configured
> >> > timeout.
> >> >
> >> >
> >> >
> >>
> https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102
> >> >
> >> >
> >> > Best,
> >> > tison.
> >> >
> >> >
> >> > tison <[email protected]> 于2021年4月23日周五 上午12:57写道：
> >> >
> >> > > To be concrete, if ZK suspended and reconnected, NodeCache already
> do
> >> > > the reset work for you and if there is a leader epoch updated,
> fencing
> >> > > token
> >> > > a.k.a leader session id would be updated so you will notice it.
> >> > >
> >> > > If ZK permanently lost, I think it is a system-wise fault and you'd
> >> > better
> >> > > restart
> >> > > the job from checkpoint/savepoint with a working ZK ensemble.
> >> > >
> >> > > I am possibly concluding without more detailed investigation though.
> >> > >
> >> > > Best,
> >> > > tison.
> >> > >
> >> > >
> >> > > tison <[email protected]> 于2021年4月23日周五 上午12:35写道：
> >> > >
> >> > >> > Unfortunately, we do not have any progress on this ticket.
> >> > >>
> >> > >> Here is a PR[1].
> >> > >>
> >> > >> Here is the base PR[2] I made about one year ago without following
> >> > review.
> >> > >>
> >> > >> [email protected]:
> >> > >>
> >> > >> It requires further investigation about the impact involved by
> >> > >> FLINK-18677[3].
> >> > >> I do have some comments[4] but so far regard it as a stability
> >> problem
> >> > >> instead of
> >> > >> correctness problem.
> >> > >>
> >> > >> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost
> >> > FOREVER,
> >> > >> and I don't want to pay any time before reactions on FLINK-10052
> >> > otherwise
> >> > >> it is highly possibly in vain again from my perspective.
> >> > >>
> >> > >> Best,
> >> > >> tison.
> >> > >>
> >> > >> [1] https://github.com/apache/flink/pull/15675
> >> > >> [2] https://github.com/apache/flink/pull/11338
> >> > >> [3] https://issues.apache.org/jira/browse/FLINK-18677
> >> > >> [4]
> https://github.com/apache/flink/pull/13055#discussion_r615871963
> >> > >>
> >> > >>
> >> > >>
> >> > >> Chen Qin <[email protected]> 于2021年4月23日周五 上午12:15写道：
> >> > >>
> >> > >>> Hi there,
> >> > >>>
> >> > >>> Quick dial back here, we have been running load testing and so far
> >> > >>> haven't
> >> > >>> seen suspended state cause job restarts.
> >> > >>>
> >> > >>> Some findings, instead of curator framework capture suspended
> state
> >> and
> >> > >>> active notify leader lost, we have seen task manager propagate
> >> > unhandled
> >> > >>> errors from zk client, most likely due to
> >> > >>> high-availability.zookeeper.client.max-retry-attempts
> >> > >>> were set to 3 and with 5 seconds interval. It would be great if we
> >> > handle
> >> > >>> this exception gracefully with a meaningful exception message.
> Those
> >> > >>> error
> >> > >>> messages happen when other task managers die due to user code
> >> > exceptions,
> >> > >>> we would like to know more insights on this as well.
> >> > >>>
> >> > >>> For more context, Lu from our team also filed [2] stating issue
> with
> >> > 1.9,
> >> > >>> so far we haven't seen regression on ongoing load testing jobs.
> >> > >>>
> >> > >>> Thanks,
> >> > >>> Chen
> >> > >>>
> >> > >>> Caused by:
> >> > >>> >
> >> > >>>
> >> >
> >>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException:
> >> > >>> > KeeperErrorCode = ConnectionLoss
> >> > >>> > at
> >> > >>> >
> >> > >>>
> >> >
> >>
> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> >> > >>> > at
> >> > >>> >
> >> > >>>
> >> >
> >>
> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862)
> >> > >>>
> >> > >>>
> >> > >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> >> > >>> [2] https://issues.apache.org/jira/browse/FLINK-19985
> >> > >>>
> >> > >>>
> >> > >>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <[email protected]>
> >> > wrote:
> >> > >>>
> >> > >>> > Thanks for trying the unfinished PR and sharing the testing
> >> results.
> >> > >>> Glad
> >> > >>> > to here that it could work
> >> > >>> > and really hope the result of more stringent load testing.
> >> > >>> >
> >> > >>> > After then I think we could revive this ticket.
> >> > >>> >
> >> > >>> >
> >> > >>> > Best,
> >> > >>> > Yang
> >> > >>> >
> >> > >>> > Chen Qin <[email protected]> 于2021年4月16日周五 上午2:01写道：
> >> > >>> >
> >> > >>> >> Hi there,
> >> > >>> >>
> >> > >>> >> Thanks for providing points to related changes and jira. Some
> >> > updates
> >> > >>> >> from our side, we applied a path by merging FLINK-10052
> >> > >>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with
> master
> >> as
> >> > >>> well
> >> > >>> >> as only handling lost state leveraging
> >> > >>> SessionConnectionStateErrorPolicy
> >> > >>> >>   FLINK-10052 <
> https://issues.apache.org/jira/browse/FLINK-10052
> >> >
> >> > >>> >>  introduced.
> >> > >>> >>
> >> > >>> >> Preliminary results were good, the same workload (240 TM) on
> the
> >> > same
> >> > >>> >> environment runs stable without frequent restarts due to
> >> suspended
> >> > >>> state
> >> > >>> >> (seems false positive). We are working on more stringent load
> >> > testing
> >> > >>> as
> >> > >>> >> well as chaos testing (blocking zk). Will keep folks posted.
> >> > >>> >>
> >> > >>> >> Thanks,
> >> > >>> >> Chen
> >> > >>> >>
> >> > >>> >>
> >> > >>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <
> >> [email protected]
> >> > >
> >> > >>> >> wrote:
> >> > >>> >>
> >> > >>> >>> Hi Chenqin,
> >> > >>> >>>
> >> > >>> >>> The current rationale behind assuming a leadership loss when
> >> > seeing a
> >> > >>> >>> SUSPENDED connection is to assume the worst and to be on the
> >> safe
> >> > >>> side.
> >> > >>> >>>
> >> > >>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the
> >> > >>> behaviour
> >> > >>> >>> configurable. Unfortunately, the community did not have enough
> >> time
> >> > >>> to
> >> > >>> >>> complete this feature.
> >> > >>> >>>
> >> > >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052
> >> > >>> >>>
> >> > >>> >>> Cheers,
> >> > >>> >>> Till
> >> > >>> >>>
> >> > >>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <
> >> [email protected]>
> >> > >>> wrote:
> >> > >>> >>>
> >> > >>> >>> > This might be related with FLINK-10052[1].
> >> > >>> >>> > Unfortunately, we do not have any progress on this ticket.
> >> > >>> >>> >
> >> > >>> >>> > cc @Till Rohrmann <[email protected]>
> >> > >>> >>> >
> >> > >>> >>> > Best,
> >> > >>> >>> > Yang
> >> > >>> >>> >
> >> > >>> >>> > chenqin <[email protected]> 于2021年4月13日周二 上午7:31写道：
> >> > >>> >>> >
> >> > >>> >>> >> Hi there,
> >> > >>> >>> >>
> >> > >>> >>> >> We observed several 1.11 job running in 1.11 restart due to
> >> job
> >> > >>> leader
> >> > >>> >>> >> lost.
> >> > >>> >>> >> Dig deeper, the issue seems related to SUSPENDED state
> >> handler
> >> > in
> >> > >>> >>> >> ZooKeeperLeaderRetrievalService.
> >> > >>> >>> >>
> >> > >>> >>> >> ASFAIK, suspended state is expected when zk is not certain
> if
> >> > >>> leader
> >> > >>> >>> is
> >> > >>> >>> >> still alive. It can follow up with RECONNECT or LOST. In
> >> current
> >> > >>> >>> >> implementation [1] , we treat suspended state same as lost
> >> state
> >> > >>> and
> >> > >>> >>> >> actively shutdown job. This pose stability issue on large
> HA
> >> > >>> setting.
> >> > >>> >>> >>
> >> > >>> >>> >> My question is can we get some insight behind this decision
> >> and
> >> > >>> could
> >> > >>> >>> we
> >> > >>> >>> >> add
> >> > >>> >>> >> some tunable configuration for user to decide how long they
> >> can
> >> > >>> endure
> >> > >>> >>> >> such
> >> > >>> >>> >> uncertain suspended state in their jobs.
> >> > >>> >>> >>
> >> > >>> >>> >> Thanks,
> >> > >>> >>> >> Chen
> >> > >>> >>> >>
> >> > >>> >>> >> [1]
> >> > >>> >>> >>
> >> > >>> >>> >>
> >> > >>> >>>
> >> > >>>
> >> >
> >>
> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
> >> > >>> >>> >>
> >> > >>> >>> >>
> >> > >>> >>> >>
> >> > >>> >>> >>
> >> > >>> >>> >> --
> >> > >>> >>> >> Sent from:
> >> > >>> >>> >>
> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
> >> > >>> >>> >>
> >> > >>> >>> >
> >> > >>> >>>
> >> > >>> >>
> >> > >>>
> >> > >>
> >> >
> >>
> >
>

Re: handle SUSPENDED in ZooKeeperLeaderRetrievalService

Reply via email to