Hello fellow Flink users, We are experiencing an issue where Flink Job Managers are intermittently unable to resume execution of a job after a Job Manager leader election that is initiated after replacing a ZK node -- this happens very infrequently and has been challenging to reproduce in a QA environment as the behavior is not predictable or reproducible. The impact of this issue is that even though we observe the ZK cluster to be healthy after the node replacement, with enough synced followers and zxid increasing via commits, the Flink cluster is unable to recover the previously running job (note: we only run one job per cluster in session mode). Restarting both of the Job Manager java processes successfully resolves the issue. I estimate this happens on average once every ~5 to 10 years of application runtime, but we run many Flink applications and see this much more frequently in production.
Our setup is: - 1 Flink job per cluster - session mode - Flink 1.20.1 - ZK server 3.6.2 - ZK server per DNS entry (zookeeper-server-1, zookeeper-server-2, zookeeper-server-3, ...) - high-availability.zookeeper.quorum connection string lists all DNS entries - 5 ZK nodes per cluster - positive DNS caching is disabled at the JVM level on the Flink cluster side to prevent permanent DNS caching This issue occurs immediately after a ZK node is replaced with a new EC2 instance (with a new IP address). When this issue occurs, we typically have more than one ZK election in close succession, causing the JM to observe the ZK connection reconnect and suspend at least twice in total. After the ZK cluster stabilizes, the Job Manager connects and I see "Connection to ZooKeeper was reconnected. Leader retrieval can be restarted" in the logs. However, after reconnection at 2025-07-21T20:14 UTC, the Job Managers are unable to either establish a single leader and/or recover the job, even though I do see the both job managers trying to run the job "switched from state RESTARTING to RUNNING" paired with "java.lang.Exception: Job leader for job id 1042c1f0b89ae487d1806954ce14e26c lost leadership.". The job remains unscheduled until 2025-07-21T20:25 when our kubernetes liveness probe restarts the process. The full logs for both Job Managers are available here: https://drive.google.com/file/d/1WMNE80i7DhWcCNNJUb5uQ-Dr8SPTg-zt/view?usp=drive_link though some configuration has been redacted and the zookeeper logs are available here: https://drive.google.com/file/d/1fux2YvQ7MxXJNojcU4iVQPzWl-ylY19r/view?usp=sharing I plan to invest in making our ZooKeeper cluster more stable, to prevent such frequent leader elections, but I feel the inability to gracefully recover without a process restart indicates some issue in the Flink leader election code as well. Have you experienced a similar failure mode or have recommendations on next steps to continue our investigation? Best, Ben