Hi Vishal, thanks for the detailed description of the problems.
1. This is currently the intended behaviour of Flink. The reason is that if the system is no longer connected to ZooKeeper then we cannot rule out that there is another process who has taken over the leadership. FLINK-10052 has the goal to make this behaviour configurable and we intend to include it in the next major release. 2. This is indeed a bug of the newly introduced application mode. With Flink 1.11.3 or 1.12.0 it should be fixed. Hence, I would recommend you to upgrade your Flink cluster. 3. Hard to tell what the problem is here. From Flink's perspective, if it cannot establish a connection to ZooKeeper, then it cannot be sure who is the leader and whether it should start executing jobs. Maybe there is a problem with the connection to the ZooKeeper cluster from the nodes on which Flink runs. Decreasing the session timeouts usually makes the connection less stable if it is a network issue. Cheers, Till On Mon, Dec 21, 2020 at 3:53 PM vishalovercome <vis...@moengage.com> wrote: > I don't know how to reproduce it but what I've observed are three kinds of > termination when connectivity with zookeeper is somehow disrupted. I don't > think its an issue with zookeeper as it supports a much bigger kafka > cluster > since a few years. > > 1. The first kind is exactly this - > https://github.com/apache/flink/pull/11338. Basically temporary loss of > connectivity or rolling upgrade of zookeeper will cause job to terminate. > It > will restart eventually from where it left off. > 2. The second kind is when job terminates and restarts for the same reason > but is unable to recover from checkpoint. I think its similar to this - > https://issues.apache.org/jira/browse/FLINK-19154. If upgrading to 1.12.0 > (from 1.11.2) will fix the second issue then I'll upgrade. > 3. The third kind is where it repeatedly restarts as its unable to > establish > a session with Zookeeper. I don't know if reducing session timeout will > help > here but in this case, I'm forced to disable zookeeper HA entirely as the > job cannot even restart here. > > I could create a JIRA ticket for discussion zookeeper itself if you suggest > but the issue of zookeeper and savepoints are related as I'm not sure what > will happen in each of the above. > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >