Re: Recovering from Kafka NoReplicaOnlineException with one node

Gwen Shapira Mon, 10 Aug 2015 13:38:00 -0700

Maybe it is not ZooKeeper itself, but the Broker connection to ZK timed-out
and caused the controller to believe that the broker is dead and therefore
attempted to elect a new leader (which doesn't exist, since you have just
one node).


Increasing the zookeeper session timeout value may help. Also, a common
cause for those timeouts is garbage collection on the broker, changing GC
policy can help.

Here is the Java configuration used by LinkedIn:

-Xms4g -Xmx4g -XX:PermSize=48m -XX:MaxPermSize=48m -XX:+UseG1GC
-XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35


Gwen


On Mon, Aug 10, 2015 at 11:12 AM, Mike Thomsen <mikerthom...@gmail.com>
wrote:

> We have a really simple Kafka set up in our development lab. It's just one
> node. Periodically, we run into this error:
>
> [2015-08-10 13:45:52,405] ERROR Controller 0 epoch 488 initiated state
> change for partition [test-data,1] from OfflinePartition to
> OnlinePartition failed (state.change.logger)
> kafka.common.NoReplicaOnlineException: No replica for partition
> [test-data,1] is alive. Live brokers are: [Set()], Assigned replicas
> are: [List(0)]
>         at
> kafka.controller.OfflinePartitionLeaderSelector.selectLeader(PartitionLeaderSelector.scala:61)
>         at
> kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:336)
>         at
> kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:185)
>         at
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:99)
>         at
> kafka.controller.PartitionStateMachine$$anonfun$triggerOnlinePartitionStateChange$3.apply(PartitionStateMachine.scala:96)
>         at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:743)
>
> Can anyone recommend a strategy for recovering from this? Is there such a
> thing or do we need to build out another node or two and set up the
> replication factor on our topics to cover all of the nodes that we put into
> the cluster?
>
> We have 3 zookeeper nodes that respond very well for other applications
> like Storm and HBase, so we're pretty confident that ZooKeeper isn't to
> blame here. Any ideas?
> Thanks,
>
> Mike
>

Re: Recovering from Kafka NoReplicaOnlineException with one node

Reply via email to