Hi there, We observed several 1.11 job running in 1.11 restart due to job leader lost. Dig deeper, the issue seems related to SUSPENDED state handler in ZooKeeperLeaderRetrievalService.
ASFAIK, suspended state is expected when zk is not certain if leader is still alive. It can follow up with RECONNECT or LOST. In current implementation [1] , we treat suspended state same as lost state and actively shutdown job. This pose stability issue on large HA setting. My question is can we get some insight behind this decision and could we add some tunable configuration for user to decide how long they can endure such uncertain suspended state in their jobs. Thanks, Chen [1] https://github.com/apache/flink/blob/release-1.11/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalService.java#L201 -- Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/