Issue with single job yarn flink cluster HA

Dinesh J Sun, 22 Mar 2020 00:56:46 -0700

Hi all,
We have single job yarn flink cluster setup with High Availability.
Sometimes job manager failure successfully restarts next attempt from
current checkpoint.
But occasionally we are getting below error.


{"errors":["Service temporarily unavailable due to an ongoing leader
election. Please refresh."]}

Hadoop version using : Hadoop 2.7.1.2.4.0.0-169

Flink version: flink-1.7.2

Zookeeper version: 3.4.6-169--1


*Below is the flink configuration*

high-availability: zookeeper

high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181

high-availability.storageDir: hdfs:///flink/ha

high-availability.zookeeper.path.root: /flink

yarn.application-attempts: 10

state.backend: rocksdb

state.checkpoints.dir: hdfs:///flink/checkpoint

state.savepoints.dir: hdfs:///flink/savepoint

jobmanager.execution.failover-strategy: region

restart-strategy: failure-rate

restart-strategy.failure-rate.max-failures-per-interval: 3

restart-strategy.failure-rate.failure-rate-interval: 5 min

restart-strategy.failure-rate.delay: 10 s



Can someone let know if I am missing something or is it a known issue?

Is it something related to hostname ip mapping issue or zookeeper version issue?

Thanks,

Dinesh

Issue with single job yarn flink cluster HA

Reply via email to