Thanks, I'll give that a shot. I noticed that our configuration used the default timeouts for session and sync, so I upped those zookeeper configuration settings for kafka as well.
On Mon, Aug 10, 2015 at 4:37 PM, Gwen Shapira <g...@confluent.io> wrote: > Maybe it is not ZooKeeper itself, but the Broker connection to ZK timed-out > and caused the controller to believe that the broker is dead and therefore > attempted to elect a new leader (which doesn't exist, since you have just > one node). > > Increasing the zookeeper session timeout value may help. Also, a common > cause for those timeouts is garbage collection on the broker, changing GC > policy can help. > > Here is the Java configuration used by LinkedIn: > > -Xms4g -Xmx4g -XX:PermSize=48m -XX:MaxPermSize=48m -XX:+UseG1GC > -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 > > > Gwen > > > On Mon, Aug 10, 2015 at 11:12 AM, Mike Thomsen <mikerthom...@gmail.com> > wrote: > > > We have a really simple Kafka set up in our development lab. It's just > one > > node. Periodically, we run into this error: > > > > [2015-08-10 13:45:52,405] ERROR Controller 0 epoch 488 initiated state > > change for partition [test-data,1] from OfflinePartition to > > OnlinePartition failed (state.change.logger) > > kafka.common.NoReplicaOnlineException: No replica for partition > > [test-data,1] is alive. Live brokers are: [Set()], Assigned replicas > > are: [List(0)] > > at > > > kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61) > > at > > > kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336) > > at > > > kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185) > > at > > > kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99) > > at > > > kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96) > > at > > > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743) > > > > Can anyone recommend a strategy for recovering from this? Is there such a > > thing or do we need to build out another node or two and set up the > > replication factor on our topics to cover all of the nodes that we put > into > > the cluster? > > > > We have 3 zookeeper nodes that respond very well for other applications > > like Storm and HBase, so we're pretty confident that ZooKeeper isn't to > > blame here. Any ideas? > > Thanks, > > > > Mike > > >