subject:"Re\: Cluster failure after zookeeper glitch."

Re: Cluster failure after zookeeper glitch.

2017-01-20 Thread Till Rohrmann

Hi Andrew, if the ZooKeeper cluster fails and Flink is not able to connect to a functioning quorum again, then it will basically stop working because the JobManagers are no longer able to elect a leader among them. The lost leadership of the JobManager can be seen in the logs (=> expected leader s

Re: Cluster failure after zookeeper glitch.

2017-01-20 Thread Stefan Richter

I would think that network problems between Flink and Zookeeper in HA mode could indeed lead to problems. Maybe Till (in CC) has a better idea of what is going on there). > Am 19.01.2017 um 14:55 schrieb Andrew Ge Wu : > > Hi Stefan > > Yes we are running in HA mode with dedicated zookeeper cl

Re: Cluster failure after zookeeper glitch.

2017-01-19 Thread Andrew Ge Wu

Hi Stefan Yes we are running in HA mode with dedicated zookeeper cluster. As far as I can see it looks like a networking issue with zookeeper cluster. 2 out of 5 zookeeper reported something around the same time: server1 2017-01-19 11:52:13,044 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0

Re: Cluster failure after zookeeper glitch.

2017-01-19 Thread Stefan Richter

Hi, I think depending on your configuration of Flink (are you using high availability mode?) and the type of ZK glitches we are talking about, it can very well be that some of Flink’s meta data in ZK got corrupted and the system can not longer operate. But for a deeper analysis, we would need m

Re: Cluster failure after zookeeper glitch.

Re: Cluster failure after zookeeper glitch.

Re: Cluster failure after zookeeper glitch.

Re: Cluster failure after zookeeper glitch.

4 matches

Site Navigation

Mail list logo

Footer information