Hi, I think depending on your configuration of Flink (are you using high availability mode?) and the type of ZK glitches we are talking about, it can very well be that some of Flink’s meta data in ZK got corrupted and the system can not longer operate. But for a deeper analysis, we would need more details about your configuration and the ZK problem.
Best, Stefan > Am 19.01.2017 um 13:16 schrieb Andrew Ge Wu <andrew.ge...@eniro.com>: > > Hi, > > > We recently had several zookeeper glitch, when that happens it seems to take > flink cluster with it. > > We are running on 1.03 > > It started like this: > > > 2017-01-19 11:52:13,047 INFO org.apache.zookeeper.ClientCnxn > - Unable to read additional data from server sessionid > 0x159b505820a0008, likely server has closed socket, closing socket connection > and attempting reconnect > 2017-01-19 11:52:13,047 INFO org.apache.zookeeper.ClientCnxn > - Unable to read additional data from server sessionid > 0x159b505820a0009, likely server has closed socket, closing socket connection > and attempting reconnect > 2017-01-19 11:52:13,151 INFO > org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager > - State change: SUSPENDED > 2017-01-19 11:52:13,151 INFO > org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager > - State change: SUSPENDED > 2017-01-19 11:52:13,166 WARN > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - > ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not > monitored (temporarily). > 2017-01-19 11:52:13,169 INFO org.apache.flink.runtime.jobmanager.JobManager > - JobManager akka://flink/user/jobmanager#1976923422 was > revoked leadership. > 2017-01-19 11:52:13,179 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - op1 -> (Map, > Map -> op2) (18/24) (5336dd375eb12616c5a0e93c84f93465) switched from RUNNING > to FAILED > > > > Then our web-ui stopped serving and job manager stuck in an exception loop > like this: > 2017-01-19 13:05:13,521 WARN org.apache.flink.runtime.jobmanager.JobManager > - Discard message > LeaderSessionMessage(0318ecf5-7069-41b2-a793-2f24bdbaa287,01/19/2017 13:05:13 > Job execution switched to status RESTARTING.) because the expected leader > session I > D None did not equal the received leader session ID > Some(0318ecf5-7069-41b2-a793-2f24bdbaa287). > 2017-01-19 13:05:13,521 INFO > org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy - > Delaying retry of job execution for xxxxx ms … > > > Is it because we misconfigured anything? or this is expected behavior? When > this happens we have to restart the cluster to bring it back. > > > Thanks! > > > Andrew > -- > Confidentiality Notice: This e-mail transmission may contain confidential > or legally privileged information that is intended only for the individual > or entity named in the e-mail address. If you are not the intended > recipient, you are hereby notified that any disclosure, copying, > distribution, or reliance upon the contents of this e-mail is strictly > prohibited and may be unlawful. If you have received this e-mail in error, > please notify the sender immediately by return e-mail and delete all copies > of this message.