Hi,
We recently had several zookeeper glitch, when that happens it seems to take flink cluster with it. We are running on 1.03 It started like this: 2017-01-19 11:52:13,047 INFO org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x159b505820a0008, likely server has closed socket, closing socket connection and attempting reconnect 2017-01-19 11:52:13,047 INFO org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x159b505820a0009, likely server has closed socket, closing socket connection and attempting reconnect 2017-01-19 11:52:13,151 INFO org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED 2017-01-19 11:52:13,151 INFO org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED 2017-01-19 11:52:13,166 WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not monitored (temporarily). 2017-01-19 11:52:13,169 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka://flink/user/jobmanager#1976923422 was revoked leadership. 2017-01-19 11:52:13,179 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - op1 -> (Map, Map -> op2) (18/24) (5336dd375eb12616c5a0e93c84f93465) switched from RUNNING to FAILED Then our web-ui stopped serving and job manager stuck in an exception loop like this: 2017-01-19 13:05:13,521 WARN org.apache.flink.runtime.jobmanager.JobManager - Discard message LeaderSessionMessage(0318ecf5-7069-41b2-a793-2f24bdbaa287,01/19/2017 13:05:13 Job execution switched to status RESTARTING.) because the expected leader session I D None did not equal the received leader session ID Some(0318ecf5-7069-41b2-a793-2f24bdbaa287). 2017-01-19 13:05:13,521 INFO org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy - Delaying retry of job execution for xxxxx ms … Is it because we misconfigured anything? or this is expected behavior? When this happens we have to restart the cluster to bring it back. Thanks! Andrew -- Confidentiality Notice: This e-mail transmission may contain confidential or legally privileged information that is intended only for the individual or entity named in the e-mail address. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or reliance upon the contents of this e-mail is strictly prohibited and may be unlawful. If you have received this e-mail in error, please notify the sender immediately by return e-mail and delete all copies of this message.