we use 0.8.2.2. 2016-05-31 20:14 GMT+08:00 Tom Crayford <tcrayf...@heroku.com>:
> Is this under 0.8? There are a few known bugs in 0.8 that can lead to this > situation. I'd recommend upgrading to 0.9 as soon as is viable to prevent > this and many other kinds of issues that were fixed in 0.9. > > Thanks > > Tom Crayford > Heroku Kafka > > On Tue, May 31, 2016 at 6:19 AM, Fredo Lee <buaatianwa...@gmail.com> > wrote: > > > thanks for your reply. > > > > yes, there are more than one controller. the msg of "soft failure" is > > reported by the old controller. > > > > 2016-05-31 11:42 GMT+08:00 Muqtafi Akhmad <muqt...@traveloka.com>: > > > > > hello Fredo, > > > > > > My guess is that there was partition leader election that was might be > > > triggered by detection of offline partition in Kafka cluster. Somehow > the > > > broker try to trigger leader election while previous election has been > > > completed hence this log : > > > > > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 initiated > > state > > > > change for partition [consup-0000000043,78] from OnlinePartition to > > > > OnlinePartition failed (state.change.logger) > > > > kafka.common.StateChangeFailedException: encountered error while > > electing > > > > leader for partition [consup-0000000043,78] due to: aborted leader > > > election > > > > for partition [consup-0000000043,78] > > > > since the LeaderAndIsr path was already written by another > controller. > > > This > > > > probably means that the current controller 1008 went through a soft > > > failure > > > > and another controller was elec > > > > ted with epoch 14.. > > > > > > The question is, > > > - was there any offline partition? > > > - was there more than one active controller? > > > > > > CMIIW > > > > > > > > > On Mon, May 30, 2016 at 2:41 PM, Fredo Lee <buaatianwa...@gmail.com> > > > wrote: > > > > > > > my server.log >>>>>>>>>>>>>>>> > > > > > > > > lots of error msg: > > > > > > > > [2016-05-28 12:12:31,132] WARN [ReplicaFetcherThread-0-1007], Replica > > > 1008 > > > > for partition [consup-0000000003,16] reset its fetch offset from > > 13985537 > > > > to current leader 1007's latest offset 13985537 > > > > (kafka.server.ReplicaFetcherThread) > > > > [2016-05-28 12:12:31,132] ERROR [ReplicaFetcherThread-0-1007], > Current > > > > offset 13987676 for partition [consup-0000000003,16] out of range; > > reset > > > > offset to 13985537 (kafka.server.ReplicaFetcherThread) > > > > > > > > > > > > the other error msg: > > > > > > > > [2016-05-28 12:12:31,708] ERROR [Replica Manager on Broker 1008]: > Error > > > > when processing fetch request for partition [consup-0000000003,35] > > offset > > > > 13848954 from consumer with correlation id 0. Possible cause: Request > > for > > > > offset 13848954 but we only have log segments in the range 12356946 > to > > > > 13847167. (kafka.server.ReplicaManager) > > > > > > > > > > > > > > > > 2016-05-30 15:37 GMT+08:00 Fredo Lee <buaatianwa...@gmail.com>: > > > > > > > > > My state-change.log>>>>>>>>>>> > > > > > > > > > > > > > > > [2016-05-28 12:33:31,510] ERROR Controller 1008 epoch 13 aborted > > leader > > > > > election for partition [consup-0000000043,78] since the > LeaderAndIsr > > > path > > > > > was already written by another controll > > > > > er. This probably means that the current controller 1008 went > > through a > > > > > soft failure and another controller was elected with epoch 14. > > > > > (state.change.logger) > > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received > > > > response > > > > > UpdateMetadataResponse(1020,-1) for a request sent to broker > > > > id:1009,host: > > > > > 22.com,port:9092 (state.change.logger) > > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received > > > > response > > > > > UpdateMetadataResponse(1066,-1) for a request sent to broker > > > > id:1001,host: > > > > > consup-kafka20.com,port:9092 (state.change.logger) > > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received > > > > response > > > > > UpdateMetadataResponse(777,-1) for a request sent to broker > > > id:1006,host: > > > > > consup-kafka11.com,port:9092 (state.change.logger) > > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received > > > > response > > > > > UpdateMetadataResponse(742,-1) for a request sent to broker > > > id:1018,host: > > > > > consup-kafka10.com,port:9092 (state.change.logger) > > > > > [2016-05-28 12:33:31,510] TRACE Controller 1008 epoch 13 received > > > > response > > > > > UpdateMetadataResponse(1021,-1) for a request sent to broker > > > > id:1009,host: > > > > > consup-kafka22.com,port:9092 (state.change.logger) > > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received > > > > response > > > > > UpdateMetadataResponse(1067,-1) for a request sent to broker > > > > id:1001,host: > > > > > consup-kafka20.com,port:9092 (state.change.logger) > > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received > > > > response > > > > > UpdateMetadataResponse(778,-1) for a request sent to broker > > > id:1006,host: > > > > > consup-kafka11.com,port:9092 (state.change.logger) > > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received > > > > response > > > > > UpdateMetadataResponse(1022,-1) for a request sent to broker > > > > id:1009,host: > > > > > consup-kafka22.com,port:9092 (state.change.logger) > > > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 > encountered > > > > error > > > > > while electing leader for partition [consup-0000000043,78] due to: > > > > aborted > > > > > leader election for partition [consup > > > > > -0000000043,78] since the LeaderAndIsr path was already written by > > > > another > > > > > controller. This probably means that the current controller 1008 > went > > > > > through a soft failure and another con > > > > > troller was elected with epoch 14.. (state.change.logger) > > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received > > > > response > > > > > UpdateMetadataResponse(743,-1) for a request sent to broker > > > id:1018,host: > > > > > consup-kafka10.co > > > > > m,port:9092 (state.change.logger) > > > > > [2016-05-28 12:33:31,511] ERROR Controller 1008 epoch 13 initiated > > > state > > > > > change for partition [consup-0000000043,78] from OnlinePartition to > > > > > OnlinePartition failed (state.change.logger) > > > > > kafka.common.StateChangeFailedException: encountered error while > > > electing > > > > > leader for partition [consup-0000000043,78] due to: aborted leader > > > > election > > > > > for partition [consup-0000000043,78] > > > > > since the LeaderAndIsr path was already written by another > > controller. > > > > > This probably means that the current controller 1008 went through a > > > soft > > > > > failure and another controller was elec > > > > > ted with epoch 14.. > > > > > at > > > > > > > > > > > > > > > kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:380) > > > > > at > > > > > > > > > > > > > > > kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:208) > > > > > at > > > > > > > > > > > > > > > kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:146) > > > > > at > > > > > > > > > > > > > > > kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:145) > > > > > at > scala.collection.immutable.Set$Set1.foreach(Set.scala:79) > > > > > at > > > > > > > > > > > > > > > kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:145) > > > > > at > > > > > > > > > > > > > > > kafka.controller.KafkaController.onPreferredReplicaElection(KafkaController.scala:631) > > > > > at > > > > > > > > > > > > > > > kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply$mcV$sp(KafkaControl > > > > > ler.scala:1158) > > > > > at > > > > > > > > > > > > > > > kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply(KafkaController.sca > > > > > la:1153) > > > > > at > > > > > > > > > > > > > > > kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17$$anonfun$apply$5.apply(KafkaController.sca > > > > > la:1153) > > > > > at kafka.utils.Utils$.inLock(Utils.scala:535) > > > > > at > > > > > > > > > > > > > > > kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17.apply(KafkaController.scala:1150) > > > > > at > > > > > > > > > > > > > > > kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4$$anonfun$apply$17.apply(KafkaController.scala:1148) > > > > > at > > > > > > > > > > > > > > > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99) > > > > > at > > > > > > > > > > > > > > > kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4.apply(KafkaController.scala:1148) > > > > > at > > > > > > > > > > > > > > > kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance$4.apply(KafkaController.scala:1127) > > > > > at > > > > > > > scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221) > > > > > at > > > > > > > > > scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428) > > > > > at > > > > > > > > > > > > > > > kafka.controller.KafkaController.kafka$controller$KafkaController$$checkAndTriggerPartitionRebalance(KafkaController.scala:1127) > > > > > at > > > > > > > > > > > > > > > kafka.controller.KafkaController$$anonfun$onControllerFailover$1.apply$mcV$sp(KafkaController.scala:326) > > > > > at > > > > > > > > > > > > > > > kafka.utils.KafkaScheduler$$anonfun$1.apply$mcV$sp(KafkaScheduler.scala:99) > > > > > at kafka.utils.Utils$$anon$1.run(Utils.scala:54) > > > > > at > > > > > > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > > > > > at > > > > > > > > > > > > > > > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) > > > > > at > > > > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) > > > > > at > > > > > > > > > > > > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) > > > > > at > > > > > > > > > > > > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) > > > > > at > > > > > > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > > > > at > > > > > > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > > > > > at java.lang.Thread.run(Thread.java:679) > > > > > Caused by: kafka.common.StateChangeFailedException: aborted leader > > > > > election for partition [consup-0000000043,78] since the > LeaderAndIsr > > > path > > > > > was already written by another controller. This probably means that > > the > > > > > current controller 1008 went through a soft failure and another > > > > controller > > > > > was elected with epoch 14. > > > > > at > > > > > > > > > > > > > > > kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:354) > > > > > ... 33 more > > > > > [2016-05-28 12:33:31,511] TRACE Controller 1008 epoch 13 received > > > > response > > > > > UpdateMetadataResponse(744,-1) for a request sent to broker > > > id:1018,host: > > > > > consup-kafka10.com,port:9092 (state.change.logger) > > > > > > > > > > > > > > > 2016-05-28 15:31 GMT+08:00 Muqtafi Akhmad <muqt...@traveloka.com>: > > > > > > > > > >> hello Fredo, > > > > >> > > > > >> Can you elaborate the 'soft' failure? > > > > >> > > > > >> > > > > >> > > > > >> On Sat, May 28, 2016 at 1:53 PM, Fredo Lee < > buaatianwa...@gmail.com > > > > > > > >> wrote: > > > > >> > > > > >> > we have a kafka cluster with 19 nodes. every week we suffer a > > soft > > > > >> failure > > > > >> > from this cluster. how to resolve this problem. > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> -- > > > > >> Muqtafi Akhmad > > > > >> Software Engineer > > > > >> Traveloka > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Muqtafi Akhmad > > > Software Engineer > > > Traveloka > > > > > >