Hi all, We have single job yarn flink cluster setup with High Availability. Sometimes job manager failure successfully restarts next attempt from current checkpoint. But occasionally we are getting below error.
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]} Hadoop version using : Hadoop 2.7.1.2.4.0.0-169 Flink version: flink-1.7.2 Zookeeper version: 3.4.6-169--1 *Below is the flink configuration* high-availability: zookeeper high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181 high-availability.storageDir: hdfs:///flink/ha high-availability.zookeeper.path.root: /flink yarn.application-attempts: 10 state.backend: rocksdb state.checkpoints.dir: hdfs:///flink/checkpoint state.savepoints.dir: hdfs:///flink/savepoint jobmanager.execution.failover-strategy: region restart-strategy: failure-rate restart-strategy.failure-rate.max-failures-per-interval: 3 restart-strategy.failure-rate.failure-rate-interval: 5 min restart-strategy.failure-rate.delay: 10 s Can someone let know if I am missing something or is it a known issue? Is it something related to hostname ip mapping issue or zookeeper version issue? Thanks, Dinesh