We have a really simple Kafka set up in our development lab. It's just one node. Periodically, we run into this error:
[2015-08-10 13:45:52,405] ERROR Controller 0 epoch 488 initiated state change for partition [test-data,1] from OfflinePartition to OnlinePartition failed (state.change.logger) kafka.common.NoReplicaOnlineException: No replica for partition [test-data,1] is alive. Live brokers are: [Set()], Assigned replicas are: [List(0)] at kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61) at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336) at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185) at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99) at kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743) Can anyone recommend a strategy for recovering from this? Is there such a thing or do we need to build out another node or two and set up the replication factor on our topics to cover all of the nodes that we put into the cluster? We have 3 zookeeper nodes that respond very well for other applications like Storm and HBase, so we're pretty confident that ZooKeeper isn't to blame here. Any ideas? Thanks, Mike