The original log (section) is preferred over rephrasing. Best, tison.
tison <wander4...@gmail.com> 于2021年4月23日周五 下午1:15写道: > Could you show the log about which unhandled exception was thrown? > > Best, > tison. > > > Chen Qin <qinnc...@gmail.com> 于2021年4月23日周五 下午1:06写道: > >> Hi Tison, >> >> Please read my latest comments in the thread. Using SessionErrorPolicy >> mitigated the suspended state issue while it might trigger an unhandled zk >> client exception in some situations. We would like to get some idea of the >> root cause of that issue to avoid introducing another issue in the fix. >> >> Chen >> >> >> On Thu, Apr 22, 2021 at 10:04 AM tison <wander4...@gmail.com> wrote: >> >> > > My question is can we get some insight behind this decision and could >> we >> > add >> > some tunable configuration for user to decide how long they can endure >> such >> > uncertain suspended state in their jobs. >> > >> > For the specific question, Curator provides a configure for session >> timeout >> > and a >> > LOST will be generated if disconnected elapsed longer then the >> configured >> > timeout. >> > >> > >> > >> https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102 >> > >> > >> > Best, >> > tison. >> > >> > >> > tison <wander4...@gmail.com> 于2021年4月23日周五 上午12:57写道: >> > >> > > To be concrete, if ZK suspended and reconnected, NodeCache already do >> > > the reset work for you and if there is a leader epoch updated, fencing >> > > token >> > > a.k.a leader session id would be updated so you will notice it. >> > > >> > > If ZK permanently lost, I think it is a system-wise fault and you'd >> > better >> > > restart >> > > the job from checkpoint/savepoint with a working ZK ensemble. >> > > >> > > I am possibly concluding without more detailed investigation though. >> > > >> > > Best, >> > > tison. >> > > >> > > >> > > tison <wander4...@gmail.com> 于2021年4月23日周五 上午12:35写道: >> > > >> > >> > Unfortunately, we do not have any progress on this ticket. >> > >> >> > >> Here is a PR[1]. >> > >> >> > >> Here is the base PR[2] I made about one year ago without following >> > review. >> > >> >> > >> qinnc...@gmail.com: >> > >> >> > >> It requires further investigation about the impact involved by >> > >> FLINK-18677[3]. >> > >> I do have some comments[4] but so far regard it as a stability >> problem >> > >> instead of >> > >> correctness problem. >> > >> >> > >> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost >> > FOREVER, >> > >> and I don't want to pay any time before reactions on FLINK-10052 >> > otherwise >> > >> it is highly possibly in vain again from my perspective. >> > >> >> > >> Best, >> > >> tison. >> > >> >> > >> [1] https://github.com/apache/flink/pull/15675 >> > >> [2] https://github.com/apache/flink/pull/11338 >> > >> [3] https://issues.apache.org/jira/browse/FLINK-18677 >> > >> [4] https://github.com/apache/flink/pull/13055#discussion_r615871963 >> > >> >> > >> >> > >> >> > >> Chen Qin <qinnc...@gmail.com> 于2021年4月23日周五 上午12:15写道: >> > >> >> > >>> Hi there, >> > >>> >> > >>> Quick dial back here, we have been running load testing and so far >> > >>> haven't >> > >>> seen suspended state cause job restarts. >> > >>> >> > >>> Some findings, instead of curator framework capture suspended state >> and >> > >>> active notify leader lost, we have seen task manager propagate >> > unhandled >> > >>> errors from zk client, most likely due to >> > >>> high-availability.zookeeper.client.max-retry-attempts >> > >>> were set to 3 and with 5 seconds interval. It would be great if we >> > handle >> > >>> this exception gracefully with a meaningful exception message. Those >> > >>> error >> > >>> messages happen when other task managers die due to user code >> > exceptions, >> > >>> we would like to know more insights on this as well. >> > >>> >> > >>> For more context, Lu from our team also filed [2] stating issue with >> > 1.9, >> > >>> so far we haven't seen regression on ongoing load testing jobs. >> > >>> >> > >>> Thanks, >> > >>> Chen >> > >>> >> > >>> Caused by: >> > >>> > >> > >>> >> > >> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: >> > >>> > KeeperErrorCode = ConnectionLoss >> > >>> > at >> > >>> > >> > >>> >> > >> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) >> > >>> > at >> > >>> > >> > >>> >> > >> org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) >> > >>> >> > >>> >> > >>> [1] https://issues.apache.org/jira/browse/FLINK-10052 >> > >>> [2] https://issues.apache.org/jira/browse/FLINK-19985 >> > >>> >> > >>> >> > >>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <danrtsey...@gmail.com> >> > wrote: >> > >>> >> > >>> > Thanks for trying the unfinished PR and sharing the testing >> results. >> > >>> Glad >> > >>> > to here that it could work >> > >>> > and really hope the result of more stringent load testing. >> > >>> > >> > >>> > After then I think we could revive this ticket. >> > >>> > >> > >>> > >> > >>> > Best, >> > >>> > Yang >> > >>> > >> > >>> > Chen Qin <qinnc...@gmail.com> 于2021年4月16日周五 上午2:01写道: >> > >>> > >> > >>> >> Hi there, >> > >>> >> >> > >>> >> Thanks for providing points to related changes and jira. Some >> > updates >> > >>> >> from our side, we applied a path by merging FLINK-10052 >> > >>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with master >> as >> > >>> well >> > >>> >> as only handling lost state leveraging >> > >>> SessionConnectionStateErrorPolicy >> > >>> >> FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052 >> > >> > >>> >> introduced. >> > >>> >> >> > >>> >> Preliminary results were good, the same workload (240 TM) on the >> > same >> > >>> >> environment runs stable without frequent restarts due to >> suspended >> > >>> state >> > >>> >> (seems false positive). We are working on more stringent load >> > testing >> > >>> as >> > >>> >> well as chaos testing (blocking zk). Will keep folks posted. >> > >>> >> >> > >>> >> Thanks, >> > >>> >> Chen >> > >>> >> >> > >>> >> >> > >>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann < >> trohrm...@apache.org >> > > >> > >>> >> wrote: >> > >>> >> >> > >>> >>> Hi Chenqin, >> > >>> >>> >> > >>> >>> The current rationale behind assuming a leadership loss when >> > seeing a >> > >>> >>> SUSPENDED connection is to assume the worst and to be on the >> safe >> > >>> side. >> > >>> >>> >> > >>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the >> > >>> behaviour >> > >>> >>> configurable. Unfortunately, the community did not have enough >> time >> > >>> to >> > >>> >>> complete this feature. >> > >>> >>> >> > >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052 >> > >>> >>> >> > >>> >>> Cheers, >> > >>> >>> Till >> > >>> >>> >> > >>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang < >> danrtsey...@gmail.com> >> > >>> wrote: >> > >>> >>> >> > >>> >>> > This might be related with FLINK-10052[1]. >> > >>> >>> > Unfortunately, we do not have any progress on this ticket. >> > >>> >>> > >> > >>> >>> > cc @Till Rohrmann <trohrm...@apache.org> >> > >>> >>> > >> > >>> >>> > Best, >> > >>> >>> > Yang >> > >>> >>> > >> > >>> >>> > chenqin <qinnc...@gmail.com> 于2021年4月13日周二 上午7:31写道: >> > >>> >>> > >> > >>> >>> >> Hi there, >> > >>> >>> >> >> > >>> >>> >> We observed several 1.11 job running in 1.11 restart due to >> job >> > >>> leader >> > >>> >>> >> lost. >> > >>> >>> >> Dig deeper, the issue seems related to SUSPENDED state >> handler >> > in >> > >>> >>> >> ZooKeeperLeaderRetrievalService. >> > >>> >>> >> >> > >>> >>> >> ASFAIK, suspended state is expected when zk is not certain if >> > >>> leader >> > >>> >>> is >> > >>> >>> >> still alive. It can follow up with RECONNECT or LOST. In >> current >> > >>> >>> >> implementation [1] , we treat suspended state same as lost >> state >> > >>> and >> > >>> >>> >> actively shutdown job. This pose stability issue on large HA >> > >>> setting. >> > >>> >>> >> >> > >>> >>> >> My question is can we get some insight behind this decision >> and >> > >>> could >> > >>> >>> we >> > >>> >>> >> add >> > >>> >>> >> some tunable configuration for user to decide how long they >> can >> > >>> endure >> > >>> >>> >> such >> > >>> >>> >> uncertain suspended state in their jobs. >> > >>> >>> >> >> > >>> >>> >> Thanks, >> > >>> >>> >> Chen >> > >>> >>> >> >> > >>> >>> >> [1] >> > >>> >>> >> >> > >>> >>> >> >> > >>> >>> >> > >>> >> > >> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201 >> > >>> >>> >> >> > >>> >>> >> >> > >>> >>> >> >> > >>> >>> >> >> > >>> >>> >> -- >> > >>> >>> >> Sent from: >> > >>> >>> >> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ >> > >>> >>> >> >> > >>> >>> > >> > >>> >>> >> > >>> >> >> > >>> >> > >> >> > >> >