Hi all, We have observed that in case some nodes of the ZK cluster are restarted (for a rolling restart) the Flink Streaming jobs fail (and restart).
Log excerpt: 2017-09-22 12:54:41,426 INFO org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x15cba6e1a239774, likely server has closed socket, closing socket connection and attempting reconnect 2017-09-22 12:54:41,527 INFO org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED 2017-09-22 12:54:41,528 WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://fl...@splat.sto.midasplayer.com:42118/user/jobmanager no longer participates in the leader election. 2017-09-22 12:54:41,528 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2017-09-22 12:54:41,528 WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper. 2017-09-22 12:54:41,530 WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not monitored (temporarily). 2017-09-22 12:54:41,530 INFO org.apache.flink.yarn.YarnJobManager - JobManager akka://flink/user/jobmanager#-317276879 was revoked leadership. 2017-09-22 12:54:41,532 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) switched from state RUNNING to SUSPENDED. java.lang.Exception: JobManager is no longer the leader. Is this the expected behaviour? Thanks, Gyula