Hello fellow Flink users,

We are experiencing an issue where Flink Job Managers are intermittently
unable to resume execution of a job after a Job Manager leader election
that is initiated after replacing a ZK node -- this happens very
infrequently and has been challenging to reproduce in a QA environment as
the behavior is not predictable or reproducible. The impact of this issue
is that even though we observe the ZK cluster to be healthy after the node
replacement, with enough synced followers and zxid increasing via commits,
the Flink cluster is unable to recover the previously running job (note: we
only run one job per cluster in session mode). Restarting both of the Job
Manager java processes successfully resolves the issue. I estimate this
happens on average once every ~5 to 10 years of application runtime, but we
run many Flink applications and see this much more frequently in production.

Our setup is:
- 1 Flink job per cluster
- session mode
- Flink 1.20.1
- ZK server 3.6.2
- ZK server per DNS entry
(zookeeper-server-1, zookeeper-server-2, zookeeper-server-3, ...)
- high-availability.zookeeper.quorum connection string lists all DNS entries
- 5 ZK nodes per cluster
- positive DNS caching is disabled at the JVM level on the Flink cluster
side to prevent permanent DNS caching

This issue occurs immediately after a ZK node is replaced with a new EC2
instance (with a new IP address). When this issue occurs, we typically have
more than one ZK election in close succession, causing the JM to observe
the ZK connection reconnect and suspend at least twice in total. After the
ZK cluster stabilizes, the Job Manager connects and I see "Connection to
ZooKeeper was reconnected. Leader retrieval can be restarted" in the logs.
However, after reconnection at 2025-07-21T20:14 UTC, the Job Managers are
unable to either establish a single leader and/or recover the job, even
though I do see the both job managers trying to run the job "switched from
state RESTARTING to RUNNING" paired with "java.lang.Exception: Job leader
for job id 1042c1f0b89ae487d1806954ce14e26c lost leadership.". The job
remains unscheduled until 2025-07-21T20:25 when our kubernetes liveness
probe restarts the process.

The full logs for both Job Managers are available here:
https://drive.google.com/file/d/1WMNE80i7DhWcCNNJUb5uQ-Dr8SPTg-zt/view?usp=drive_link
though
some configuration has been redacted and the zookeeper logs are available
here:
https://drive.google.com/file/d/1fux2YvQ7MxXJNojcU4iVQPzWl-ylY19r/view?usp=sharing

I plan to invest in making our ZooKeeper cluster more stable, to prevent
such frequent leader elections, but I feel the inability to gracefully
recover without a process restart indicates some issue in the Flink leader
election code as well. Have you experienced a similar failure mode or have
recommendations on next steps to continue our investigation?

Best,
Ben

Reply via email to