Thanks for trying the unfinished PR and sharing the testing results. Glad
to here that it could work
and really hope the result of more stringent load testing.

After then I think we could revive this ticket.


Best,
Yang

Chen Qin <qinnc...@gmail.com> 于2021年4月16日周五 上午2:01写道:

> Hi there,
>
> Thanks for providing points to related changes and jira. Some updates from
> our side, we applied a path by merging FLINK-10052
> <https://issues.apache.org/jira/browse/FLINK-10052> with master as well
> as only handling lost state leveraging SessionConnectionStateErrorPolicy
> FLINK-10052 <https://issues.apache.org/jira/browse/FLINK-10052>
>  introduced.
>
> Preliminary results were good, the same workload (240 TM) on the same
> environment runs stable without frequent restarts due to suspended state
> (seems false positive). We are working on more stringent load testing as
> well as chaos testing (blocking zk). Will keep folks posted.
>
> Thanks,
> Chen
>
>
> On Tue, Apr 13, 2021 at 1:34 AM Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> Hi Chenqin,
>>
>> The current rationale behind assuming a leadership loss when seeing a
>> SUSPENDED connection is to assume the worst and to be on the safe side.
>>
>> Yang Wang is correct. FLINK-10052 [1] has the goal to make the behaviour
>> configurable. Unfortunately, the community did not have enough time to
>> complete this feature.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-10052
>>
>> Cheers,
>> Till
>>
>> On Tue, Apr 13, 2021 at 8:25 AM Yang Wang <danrtsey...@gmail.com> wrote:
>>
>> > This might be related with FLINK-10052[1].
>> > Unfortunately, we do not have any progress on this ticket.
>> >
>> > cc @Till Rohrmann <trohrm...@apache.org>
>> >
>> > Best,
>> > Yang
>> >
>> > chenqin <qinnc...@gmail.com> 于2021年4月13日周二 上午7:31写道:
>> >
>> >> Hi there,
>> >>
>> >> We observed several 1.11 job running in 1.11 restart due to job leader
>> >> lost.
>> >> Dig deeper, the issue seems related to SUSPENDED state handler in
>> >> ZooKeeperLeaderRetrievalService.
>> >>
>> >> ASFAIK, suspended state is expected when zk is not certain if leader is
>> >> still alive. It can follow up with RECONNECT or LOST. In current
>> >> implementation [1] , we treat suspended state same as lost state and
>> >> actively shutdown job. This pose stability issue on large HA setting.
>> >>
>> >> My question is can we get some insight behind this decision and could
>> we
>> >> add
>> >> some tunable configuration for user to decide how long they can endure
>> >> such
>> >> uncertain suspended state in their jobs.
>> >>
>> >> Thanks,
>> >> Chen
>> >>
>> >> [1]
>> >>
>> >>
>> https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201
>> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Sent from:
>> >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>> >>
>> >
>>
>

Reply via email to