Thanks, I'll give that a shot. I noticed that our configuration used the
default timeouts for session and sync, so I upped those zookeeper
configuration settings for kafka as well.

On Mon, Aug 10, 2015 at 4:37 PM, Gwen Shapira <g...@confluent.io> wrote:

> Maybe it is not ZooKeeper itself, but the Broker connection to ZK timed-out
> and caused the controller to believe that the broker is dead and therefore
> attempted to elect a new leader (which doesn't exist, since you have just
> one node).
>
> Increasing the zookeeper session timeout value may help. Also, a common
> cause for those timeouts is garbage collection on the broker, changing GC
> policy can help.
>
> Here is the Java configuration used by LinkedIn:
>
> -Xms4g -Xmx4g -XX:PermSize=48m -XX:MaxPermSize=48m -XX:+UseG1GC
> -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35
>
>
> Gwen
>
>
> On Mon, Aug 10, 2015 at 11:12 AM, Mike Thomsen <mikerthom...@gmail.com>
> wrote:
>
> > We have a really simple Kafka set up in our development lab. It's just
> one
> > node. Periodically, we run into this error:
> >
> > [2015-08-10 13:45:52,405] ERROR Controller 0 epoch 488 initiated state
> > change for partition [test-data,1] from OfflinePartition to
> > OnlinePartition failed (state.change.logger)
> > kafka.common.NoReplicaOnlineException: No replica for partition
> > [test-data,1] is alive. Live brokers are: [Set()], Assigned replicas
> > are: [List(0)]
> >         at
> >
> kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61)
> >         at
> >
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336)
> >         at
> >
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185)
> >         at
> >
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99)
> >         at
> >
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96)
> >         at
> >
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743)
> >
> > Can anyone recommend a strategy for recovering from this? Is there such a
> > thing or do we need to build out another node or two and set up the
> > replication factor on our topics to cover all of the nodes that we put
> into
> > the cluster?
> >
> > We have 3 zookeeper nodes that respond very well for other applications
> > like Storm and HBase, so we're pretty confident that ZooKeeper isn't to
> > blame here. Any ideas?
> > Thanks,
> >
> > Mike
> >
>

Reply via email to