Hi, Everybody! Every week on Friday's night I lose ISR for some partitions in my kafka cluster:
Topic: test-res-met.server_logs.conv Partition: 18 Leader: 45 Replicas: 45,61 Isr: 45 Current controller: 45 Partitions with leader #61 are available, I lose broker #61 only as ISR for partitions with another leader. State logs on broker 61: [2015-10-30 23:02:34,012] ERROR Controller 61 epoch 2233 initiated state change of replica 61 for partition [test-res-met.server_logs.conv,18] from OnlineReplica to OfflineReplic a failed (state.change.logger) kafka.common.StateChangeFailedException: Leader and isr path written by another controller. This probablymeans the current controller with epoch 2233 went through a soft failure and another controller was elected with epoch 2234. Aborting state change by this controller at kafka.controller.KafkaController.removeReplicaFromIsr(KafkaController.scala:1002) at kafka.controller.ReplicaStateMachine.handleStateChange(ReplicaStateMachine.scala:250) at kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114) at kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114) at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) at kafka.controller.ReplicaStateMachine.handleStateChanges(ReplicaStateMachine.scala:114) at kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:451) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:373) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:358) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357) at kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357) at kafka.utils.Utils$.inLock(Utils.scala:535) at kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:356) at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:568) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) Restart of bad broker (#61) helps. We have 7day retention for our logs (log.retention.hours=168). Also I checked ZK and cron. Could anyone explain such issue? Kafka 0.8.2.1.