Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from
current checkpoint.
But occasionally we are getting below error.
{"errors":["Service temporarily unavailable due to an ongoing leader
election. Please refresh."]}
Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
Flink version: flink-1.7.2
Zookeeper version: 3.4.6-169--1
*Below is the flink configuration*
high-availability: zookeeper
high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
high-availability.storageDir: hdfs:///flink/ha
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10
state.backend: rocksdb
state.checkpoints.dir: hdfs:///flink/checkpoint
state.savepoints.dir: hdfs:///flink/savepoint
jobmanager.execution.failover-strategy: region
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s
Can someone let know if I am missing something or is it a known issue?
Is it something related to hostname ip mapping issue or zookeeper version issue?
Thanks,
Dinesh