Could you show the log about which unhandled exception was thrown? Best, tison.
Chen Qin <qinnc...@gmail.com> 于2021年4月23日周五 下午1:06写道: > Hi Tison, > > Please read my latest comments in the thread. Using SessionErrorPolicy > mitigated the suspended state issue while it might trigger an unhandled zk > client exception in some situations. We would like to get some idea of the > root cause of that issue to avoid introducing another issue in the fix. > > Chen > > > On Thu, Apr 22, 2021 at 10:04 AM tison <wander4...@gmail.com> wrote: > > > > My question is can we get some insight behind this decision and could > we > > add > > some tunable configuration for user to decide how long they can endure > such > > uncertain suspended state in their jobs. > > > > For the specific question, Curator provides a configure for session > timeout > > and a > > LOST will be generated if disconnected elapsed longer then the configured > > timeout. > > > > > > > https://github.com/apache/flink/blob/58a7c80fa35424608ad44d1d6691d1407be0092a/flink-runtime/src/main/java/org/apache/flink/runtime/util/ZooKeeperUtils.java#L101-L102 > > > > > > Best, > > tison. > > > > > > tison <wander4...@gmail.com> 于2021年4月23日周五 上午12:57写道: > > > > > To be concrete, if ZK suspended and reconnected, NodeCache already do > > > the reset work for you and if there is a leader epoch updated, fencing > > > token > > > a.k.a leader session id would be updated so you will notice it. > > > > > > If ZK permanently lost, I think it is a system-wise fault and you'd > > better > > > restart > > > the job from checkpoint/savepoint with a working ZK ensemble. > > > > > > I am possibly concluding without more detailed investigation though. > > > > > > Best, > > > tison. > > > > > > > > > tison <wander4...@gmail.com> 于2021年4月23日周五 上午12:35写道: > > > > > >> > Unfortunately, we do not have any progress on this ticket. > > >> > > >> Here is a PR[1]. > > >> > > >> Here is the base PR[2] I made about one year ago without following > > review. > > >> > > >> qinnc...@gmail.com: > > >> > > >> It requires further investigation about the impact involved by > > >> FLINK-18677[3]. > > >> I do have some comments[4] but so far regard it as a stability problem > > >> instead of > > >> correctness problem. > > >> > > >> FLINK-18677 tries to "fix" an unreasonable scenario where zk lost > > FOREVER, > > >> and I don't want to pay any time before reactions on FLINK-10052 > > otherwise > > >> it is highly possibly in vain again from my perspective. > > >> > > >> Best, > > >> tison. > > >> > > >> [1] https://github.com/apache/flink/pull/15675 > > >> [2] https://github.com/apache/flink/pull/11338 > > >> [3] https://issues.apache.org/jira/browse/FLINK-18677 > > >> [4] https://github.com/apache/flink/pull/13055#discussion_r615871963 > > >> > > >> > > >> > > >> Chen Qin <qinnc...@gmail.com> 于2021年4月23日周五 上午12:15写道: > > >> > > >>> Hi there, > > >>> > > >>> Quick dial back here, we have been running load testing and so far > > >>> haven't > > >>> seen suspended state cause job restarts. > > >>> > > >>> Some findings, instead of curator framework capture suspended state > and > > >>> active notify leader lost, we have seen task manager propagate > > unhandled > > >>> errors from zk client, most likely due to > > >>> high-availability.zookeeper.client.max-retry-attempts > > >>> were set to 3 and with 5 seconds interval. It would be great if we > > handle > > >>> this exception gracefully with a meaningful exception message. Those > > >>> error > > >>> messages happen when other task managers die due to user code > > exceptions, > > >>> we would like to know more insights on this as well. > > >>> > > >>> For more context, Lu from our team also filed [2] stating issue with > > 1.9, > > >>> so far we haven't seen regression on ongoing load testing jobs. > > >>> > > >>> Thanks, > > >>> Chen > > >>> > > >>> Caused by: > > >>> > > > >>> > > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: > > >>> > KeeperErrorCode = ConnectionLoss > > >>> > at > > >>> > > > >>> > > > org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > > >>> > at > > >>> > > > >>> > > > org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) > > >>> > > >>> > > >>> [1] https://issues.apache.org/jira/browse/FLINK-10052 > > >>> [2] https://issues.apache.org/jira/browse/FLINK-19985 > > >>> > > >>> > > >>> On Thu, Apr 15, 2021 at 7:27 PM Yang Wang <danrtsey...@gmail.com> > > wrote: > > >>> > > >>> > Thanks for trying the unfinished PR and sharing the testing > results. > > >>> Glad > > >>> > to here that it could work > > >>> > and really hope the result of more stringent load testing. > > >>> > > > >>> > After then I think we could revive this ticket. > > >>> > > > >>> > > > >>> > Best, > > >>> > Yang > > >>> > > > >>> > Chen Qin <qinnc...@gmail.com> 于2021年4月16日周五 上午2:01写道: > > >>> > > > >>> >> Hi there, > > >>> >> > > >>> >> Thanks for providing points to related changes and jira. Some > > updates > > >>> >> from our side, we applied a path by merging FLINK-10052 > > >>> >> <https://issues.apache.org/jira/browse/FLINK-10052> with master > as > > >>> well > > >>> >> as only handling lost state leveraging > > >>> SessionConnectionStateErrorPolicy > > >>> >> FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052> > > >>> >> introduced. > > >>> >> > > >>> >> Preliminary results were good, the same workload (240 TM) on the > > same > > >>> >> environment runs stable without frequent restarts due to suspended > > >>> state > > >>> >> (seems false positive). We are working on more stringent load > > testing > > >>> as > > >>> >> well as chaos testing (blocking zk). Will keep folks posted. > > >>> >> > > >>> >> Thanks, > > >>> >> Chen > > >>> >> > > >>> >> > > >>> >> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann < > trohrm...@apache.org > > > > > >>> >> wrote: > > >>> >> > > >>> >>> Hi Chenqin, > > >>> >>> > > >>> >>> The current rationale behind assuming a leadership loss when > > seeing a > > >>> >>> SUSPENDED connection is to assume the worst and to be on the safe > > >>> side. > > >>> >>> > > >>> >>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the > > >>> behaviour > > >>> >>> configurable. Unfortunately, the community did not have enough > time > > >>> to > > >>> >>> complete this feature. > > >>> >>> > > >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-10052 > > >>> >>> > > >>> >>> Cheers, > > >>> >>> Till > > >>> >>> > > >>> >>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <danrtsey...@gmail.com > > > > >>> wrote: > > >>> >>> > > >>> >>> > This might be related with FLINK-10052[1]. > > >>> >>> > Unfortunately, we do not have any progress on this ticket. > > >>> >>> > > > >>> >>> > cc @Till Rohrmann <trohrm...@apache.org> > > >>> >>> > > > >>> >>> > Best, > > >>> >>> > Yang > > >>> >>> > > > >>> >>> > chenqin <qinnc...@gmail.com> 于2021年4月13日周二 上午7:31写道: > > >>> >>> > > > >>> >>> >> Hi there, > > >>> >>> >> > > >>> >>> >> We observed several 1.11 job running in 1.11 restart due to > job > > >>> leader > > >>> >>> >> lost. > > >>> >>> >> Dig deeper, the issue seems related to SUSPENDED state handler > > in > > >>> >>> >> ZooKeeperLeaderRetrievalService. > > >>> >>> >> > > >>> >>> >> ASFAIK, suspended state is expected when zk is not certain if > > >>> leader > > >>> >>> is > > >>> >>> >> still alive. It can follow up with RECONNECT or LOST. In > current > > >>> >>> >> implementation [1] , we treat suspended state same as lost > state > > >>> and > > >>> >>> >> actively shutdown job. This pose stability issue on large HA > > >>> setting. > > >>> >>> >> > > >>> >>> >> My question is can we get some insight behind this decision > and > > >>> could > > >>> >>> we > > >>> >>> >> add > > >>> >>> >> some tunable configuration for user to decide how long they > can > > >>> endure > > >>> >>> >> such > > >>> >>> >> uncertain suspended state in their jobs. > > >>> >>> >> > > >>> >>> >> Thanks, > > >>> >>> >> Chen > > >>> >>> >> > > >>> >>> >> [1] > > >>> >>> >> > > >>> >>> >> > > >>> >>> > > >>> > > > https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201 > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> > > >>> >>> >> -- > > >>> >>> >> Sent from: > > >>> >>> >> > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > > >>> >>> >> > > >>> >>> > > > >>> >>> > > >>> >> > > >>> > > >> > > >