Maybe it is not ZooKeeper itself, but the Broker connection to ZK timed-out and caused the controller to believe that the broker is dead and therefore attempted to elect a new leader (which doesn't exist, since you have just one node).
Increasing the zookeeper session timeout value may help. Also, a common cause for those timeouts is garbage collection on the broker, changing GC policy can help. Here is the Java configuration used by LinkedIn: -Xms4g -Xmx4g -XX:PermSize=48m -XX:MaxPermSize=48m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 Gwen On Mon, Aug 10, 2015 at 11:12 AM, Mike Thomsen <mikerthom...@gmail.com> wrote: > We have a really simple Kafka set up in our development lab. It's just one > node. Periodically, we run into this error: > > [2015-08-10 13:45:52,405] ERROR Controller 0 epoch 488 initiated state > change for partition [test-data,1] from OfflinePartition to > OnlinePartition failed (state.change.logger) > kafka.common.NoReplicaOnlineException: No replica for partition > [test-data,1] is alive. Live brokers are: [Set()], Assigned replicas > are: [List(0)] > at > kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61) > at > kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336) > at > kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185) > at > kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99) > at > kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743) > > Can anyone recommend a strategy for recovering from this? Is there such a > thing or do we need to build out another node or two and set up the > replication factor on our topics to cover all of the nodes that we put into > the cluster? > > We have 3 zookeeper nodes that respond very well for other applications > like Storm and HBase, so we're pretty confident that ZooKeeper isn't to > blame here. Any ideas? > Thanks, > > Mike >