Hi there,

We observed several 1.11 job running in 1.11 restart due to job leader lost.
Dig deeper, the issue seems related to SUSPENDED state handler in
ZooKeeperLeaderRetrievalService.

ASFAIK, suspended state is expected when zk is not certain if leader is
still alive. It can follow up with RECONNECT or LOST. In current
implementation [1] , we treat suspended state same as lost state and
actively shutdown job. This pose stability issue on large HA setting. 

My question is can we get some insight behind this decision and could we add
some tunable configuration for user to decide how long they can endure such
uncertain suspended state in their jobs.

Thanks,
Chen

[1]
https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201




--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/

Reply via email to