Re: Cluster failure after zookeeper glitch.

2017-01-20 Thread Till Rohrmann
Hi Andrew, if the ZooKeeper cluster fails and Flink is not able to connect to a functioning quorum again, then it will basically stop working because the JobManagers are no longer able to elect a leader among them. The lost leadership of the JobManager can be seen in the logs (=> expected leader s

Re: Cluster failure after zookeeper glitch.

2017-01-20 Thread Stefan Richter
I would think that network problems between Flink and Zookeeper in HA mode could indeed lead to problems. Maybe Till (in CC) has a better idea of what is going on there). > Am 19.01.2017 um 14:55 schrieb Andrew Ge Wu : > > Hi Stefan > > Yes we are running in HA mode with dedicated zookeeper cl

Re: Cluster failure after zookeeper glitch.

2017-01-19 Thread Andrew Ge Wu
Hi Stefan Yes we are running in HA mode with dedicated zookeeper cluster. As far as I can see it looks like a networking issue with zookeeper cluster. 2 out of 5 zookeeper reported something around the same time: server1 2017-01-19 11:52:13,044 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0

Re: Cluster failure after zookeeper glitch.

2017-01-19 Thread Stefan Richter
Hi, I think depending on your configuration of Flink (are you using high availability mode?) and the type of ZK glitches we are talking about, it can very well be that some of Flink’s meta data in ZK got corrupted and the system can not longer operate. But for a deeper analysis, we would need m