The broker 61 some how falls behind in fetching from the leader brokers and hence falls out of the ISR.
[2015-10-30 23:02:34,012] ERROR Controller 61 epoch 2233 initiated state change of replica 61 for partition [test-res-met.server_logs.conv,18] from OnlineReplica to OfflineReplica... means that the current controller underwent a failure and came back up, but some other controller was elected in meant time. The old controller will eventually resign. Is this log after you rebounce 61? Thanks, Mayuresh On Sat, Oct 31, 2015 at 5:09 AM, Gleb Zhukov <gzhu...@iponweb.net> wrote: > Hi, Everybody! > > Every week on Friday's night I lose ISR for some partitions in my kafka > cluster: > > Topic: test-res-met.server_logs.conv Partition: 18 Leader: 45 > Replicas: 45,61 Isr: 45 > Current controller: 45 > Partitions with leader #61 are available, I lose broker #61 only as ISR for > partitions with another leader. > > State logs on broker 61: > > [2015-10-30 23:02:34,012] ERROR Controller 61 epoch 2233 initiated state > change of replica 61 for partition [test-res-met.server_logs.conv,18] from > OnlineReplica to OfflineReplic > a failed (state.change.logger) > kafka.common.StateChangeFailedException: Leader and isr path written by > another controller. This probablymeans the current controller with epoch > 2233 went through a soft failure > and another controller was elected with epoch 2234. Aborting state change > by this controller > at > > kafka.controller.KafkaController.removeReplicaFromIsr(KafkaController.scala:1002) > at > > kafka.controller.ReplicaStateMachine.handleStateChange(ReplicaStateMachine.scala:250) > at > > kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114) > at > > kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114) > at > scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153) > at > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) > at > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) > at > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) > at > > kafka.controller.ReplicaStateMachine.handleStateChanges(ReplicaStateMachine.scala:114) > at > kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:451) > at > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:373) > at > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359) > at > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) > at > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:358) > at > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357) > at > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357) > at kafka.utils.Utils$.inLock(Utils.scala:535) > at > > kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:356) > at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:568) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > > Restart of bad broker (#61) helps. > We have 7day retention for our logs (log.retention.hours=168). Also I > checked ZK and cron. Could anyone explain such issue? Kafka 0.8.2.1. > -- -Regards, Mayuresh R. Gharat (862) 250-7125