Hi, Everybody!

Every week on Friday's night I lose ISR for some partitions in my kafka
cluster:

Topic: test-res-met.server_logs.conv  Partition: 18    Leader: 45
Replicas: 45,61    Isr: 45
Current controller: 45
Partitions with leader #61 are available, I lose broker #61 only as ISR for
partitions with another leader.

State logs on broker 61:

[2015-10-30 23:02:34,012] ERROR Controller 61 epoch 2233 initiated state
change of replica 61 for partition [test-res-met.server_logs.conv,18] from
OnlineReplica to OfflineReplic
a failed (state.change.logger)
kafka.common.StateChangeFailedException: Leader and isr path written by
another controller. This probablymeans the current controller with epoch
2233 went through a soft failure
and another controller was elected with epoch 2234. Aborting state change
by this controller
        at
kafka.controller.KafkaController.removeReplicaFromIsr(KafkaController.scala:1002)
        at
kafka.controller.ReplicaStateMachine.handleStateChange(ReplicaStateMachine.scala:250)
        at
kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114)
        at
kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114)
        at
scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
        at
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
        at
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
        at
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
        at
kafka.controller.ReplicaStateMachine.handleStateChanges(ReplicaStateMachine.scala:114)
        at
kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:451)
        at
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:373)
        at
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359)
        at
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359)
        at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33)
        at
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:358)
        at
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357)
        at
kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357)
        at kafka.utils.Utils$.inLock(Utils.scala:535)
        at
kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:356)
        at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:568)
        at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71)

Restart of bad broker (#61) helps.
We have 7day retention for our logs (log.retention.hours=168). Also I
checked ZK and cron. Could anyone explain such issue? Kafka 0.8.2.1.

Reply via email to