Sure, I update jira with exception info. We could follow up from there for technical discussions.
https://issues.apache.org/jira/browse/FLINK-10052?focusedCommentId=17330858&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17330858 On Thu, Apr 22, 2021 at 10:17 PM tison <wander4...@gmail.com> wrote: > The original log (section) is preferred over rephrasing. > Best, > tison. > > > tison <wander4...@gmail.com> 于2021年4月23日周五 下午1:15写道: > > > Could you show the log about which unhandled exception was thrown? > > > > Best, > > tison. > > > > > > Chen Qin <qinnc...@gmail.com> 于2021年4月23日周五 下午1:06写道: > > > >> Hi Tison, > >> > >> Please read my latest comments in the thread. Using SessionErrorPolicy > >> mitigated the suspended state issue while it might trigger an unhandled > zk > >> client exception in some situations. We would like to get some idea of > the > >> root cause of that issue to avoid introducing another issue in the fix. > >> > >> Chen > >> > >> > >> On Thu, Apr 22, 2021 at 10:04 AM tison <wander4...@gmail.com> wrote: > >> > >> > > My question is can we get some insight behind this decision and > could > >> we > >> > add > >> > some tunable configuration for user to decide how long they can endure > >> such > >> > uncertain suspended state in their jobs. > >> > > >> > For the specific question, Curator provides a configure for session > >> timeout > >> > and a > >> > LOST will be generated if disconnected elapsed longer then the > >> configured > >> > timeout. > >> > > >> > > >> > > >> > https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102 > >> > > >> > > >> > Best, > >> > tison. > >> > > >> > > >> > tison <wander4...@gmail.com> 于2021年4月23日周五 上午12:57写道: > >> > > >> > > To be concrete, if ZK suspended and reconnected, NodeCache already > do > >> > > the reset work for you and if there is a leader epoch updated, > fencing > >> > > token > >> > > a.k.a leader session id would be updated so you will notice it. > >> > > > >> > > If ZK permanently lost, I think it is a system-wise fault and you'd > >> > better > >> > > restart > >> > > the job from checkpoint/savepoint with a working ZK ensemble. > >> > > > >> > > I am possibly concluding without more detailed investigation though. > >> > > > >> > > Best, > >> > > tison. > >> > > > >> > > > >> > > tison <wander4...@gmail.com> 于2021年4月23日周五 上午12:35写道: > >> > > > >> > >> > Unfortunately, we do not have any progress on this ticket. > >> > >> > >> > >> Here is a PR[1]. > >> > >> > >> > >> Here is the base PR[2] I made about one year ago without following > >> > review. > >> > >> > >> > >> qinnc...@gmail.com: > >> > >> > >> > >> It requires further investigation about the impact involved by > >> > >> FLINK-18677[3]. > >> > >> I do have some comments[4] but so far regard it as a stability > >> problem > >> > >> instead of > >> > >> correctness problem. > >> > >> > >> > >> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost > >> > FOREVER, > >> > >> and I don't want to pay any time before reactions on FLINK-10052 > >> > otherwise > >> > >> it is highly possibly in vain again from my perspective. > >> > >> > >> > >> Best, > >> > >> tison. > >> > >> > >> > >> [1] https://github.com/apache/flink/pull/15675 > >> > >> [2] https://github.com/apache/flink/pull/11338 > >> > >> [3] https://issues.apache.org/jira/browse/FLINK-18677 > >> > >> [4] > https://github.com/apache/flink/pull/13055#discussion_r615871963 > >> > >> > >> > >> > >> > >> > >> > >> Chen Qin <qinnc...@gmail.com> 于2021年4月23日周五 上午12:15写道: > >> > >> > >> > >>> Hi there, > >> > >>> > >> > >>> Quick dial back here, we have been running load testing and so far > >> > >>> haven't > >> > >>> seen suspended state cause job restarts. > >> > >>> > >> > >>> Some findings, instead of curator framework capture suspended > state > >> and > >> > >>> active notify leader lost, we have seen task manager propagate > >> > unhandled > >> > >>> errors from zk client, most likely due to > >> > >>> high-availability.zookeeper.client.max-retry-attempts > >> > >>> were set to 3 and with 5 seconds interval. It would be great if we > >> > handle > >> > >>> this exception gracefully with a meaningful exception message. > Those > >> > >>> error > >> > >>> messages happen when other task managers die due to user code > >> > exceptions, > >> > >>> we would like to know more insights on this as well. > >> > >>> > >> > >>> For more context, Lu from our team also filed [2] stating issue > with > >> > 1.9, > >> > >>> so far we haven't seen regression on ongoing load testing jobs. > >> > >>> > >> > >>> Thanks, > >> > >>> Chen > >> > >>> > >> > >>> Caused by: > >> > >>> > > >> > >>> > >> > > >> > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: > >> > >>> > KeeperErrorCode = ConnectionLoss > >> > >>> > at > >> > >>> > > >> > >>> > >> > > >> > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > >> > >>> > at > >> > >>> > > >> > >>> > >> > > >> > org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) > >> > >>> > >> > >>> > >> > >>> [1] https://issues.apache.org/jira/browse/FLINK-10052 > >> > >>> [2] https://issues.apache.org/jira/browse/FLINK-19985 > >> > >>> > >> > >>> > >> > >>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <danrtsey...@gmail.com> > >> > wrote: > >> > >>> > >> > >>> > Thanks for trying the unfinished PR and sharing the testing > >> results. > >> > >>> Glad > >> > >>> > to here that it could work > >> > >>> > and really hope the result of more stringent load testing. > >> > >>> > > >> > >>> > After then I think we could revive this ticket. > >> > >>> > > >> > >>> > > >> > >>> > Best, > >> > >>> > Yang > >> > >>> > > >> > >>> > Chen Qin <qinnc...@gmail.com> 于2021年4月16日周五 上午2:01写道: > >> > >>> > > >> > >>> >> Hi there, > >> > >>> >> > >> > >>> >> Thanks for providing points to related changes and jira. Some > >> > updates > >> > >>> >> from our side, we applied a path by merging FLINK-10052 > >> > >>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with > master > >> as > >> > >>> well > >> > >>> >> as only handling lost state leveraging > >> > >>> SessionConnectionStateErrorPolicy > >> > >>> >> FLINK-10052 < > https://issues.apache.org/jira/browse/FLINK-10052 > >> > > >> > >>> >> introduced. > >> > >>> >> > >> > >>> >> Preliminary results were good, the same workload (240 TM) on > the > >> > same > >> > >>> >> environment runs stable without frequent restarts due to > >> suspended > >> > >>> state > >> > >>> >> (seems false positive). We are working on more stringent load > >> > testing > >> > >>> as > >> > >>> >> well as chaos testing (blocking zk). Will keep folks posted. > >> > >>> >> > >> > >>> >> Thanks, > >> > >>> >> Chen > >> > >>> >> > >> > >>> >> > >> > >>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann < > >> trohrm...@apache.org > >> > > > >> > >>> >> wrote: > >> > >>> >> > >> > >>> >>> Hi Chenqin, > >> > >>> >>> > >> > >>> >>> The current rationale behind assuming a leadership loss when > >> > seeing a > >> > >>> >>> SUSPENDED connection is to assume the worst and to be on the > >> safe > >> > >>> side. > >> > >>> >>> > >> > >>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the > >> > >>> behaviour > >> > >>> >>> configurable. Unfortunately, the community did not have enough > >> time > >> > >>> to > >> > >>> >>> complete this feature. > >> > >>> >>> > >> > >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052 > >> > >>> >>> > >> > >>> >>> Cheers, > >> > >>> >>> Till > >> > >>> >>> > >> > >>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang < > >> danrtsey...@gmail.com> > >> > >>> wrote: > >> > >>> >>> > >> > >>> >>> > This might be related with FLINK-10052[1]. > >> > >>> >>> > Unfortunately, we do not have any progress on this ticket. > >> > >>> >>> > > >> > >>> >>> > cc @Till Rohrmann <trohrm...@apache.org> > >> > >>> >>> > > >> > >>> >>> > Best, > >> > >>> >>> > Yang > >> > >>> >>> > > >> > >>> >>> > chenqin <qinnc...@gmail.com> 于2021年4月13日周二 上午7:31写道: > >> > >>> >>> > > >> > >>> >>> >> Hi there, > >> > >>> >>> >> > >> > >>> >>> >> We observed several 1.11 job running in 1.11 restart due to > >> job > >> > >>> leader > >> > >>> >>> >> lost. > >> > >>> >>> >> Dig deeper, the issue seems related to SUSPENDED state > >> handler > >> > in > >> > >>> >>> >> ZooKeeperLeaderRetrievalService. > >> > >>> >>> >> > >> > >>> >>> >> ASFAIK, suspended state is expected when zk is not certain > if > >> > >>> leader > >> > >>> >>> is > >> > >>> >>> >> still alive. It can follow up with RECONNECT or LOST. In > >> current > >> > >>> >>> >> implementation [1] , we treat suspended state same as lost > >> state > >> > >>> and > >> > >>> >>> >> actively shutdown job. This pose stability issue on large > HA > >> > >>> setting. > >> > >>> >>> >> > >> > >>> >>> >> My question is can we get some insight behind this decision > >> and > >> > >>> could > >> > >>> >>> we > >> > >>> >>> >> add > >> > >>> >>> >> some tunable configuration for user to decide how long they > >> can > >> > >>> endure > >> > >>> >>> >> such > >> > >>> >>> >> uncertain suspended state in their jobs. > >> > >>> >>> >> > >> > >>> >>> >> Thanks, > >> > >>> >>> >> Chen > >> > >>> >>> >> > >> > >>> >>> >> [1] > >> > >>> >>> >> > >> > >>> >>> >> > >> > >>> >>> > >> > >>> > >> > > >> > https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201 > >> > >>> >>> >> > >> > >>> >>> >> > >> > >>> >>> >> > >> > >>> >>> >> > >> > >>> >>> >> -- > >> > >>> >>> >> Sent from: > >> > >>> >>> >> > >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > >> > >>> >>> >> > >> > >>> >>> > > >> > >>> >>> > >> > >>> >> > >> > >>> > >> > >> > >> > > >> > > >