Hi We have recently upgraded from Kafka 0.10 to 1.1 , and we have encountered several occasions where some partitions in the cluster would go offline and unable to recover with the following error:
20:33:04.702 [controller-event-thread] ERROR state.change.logger - [Controller id=1 epoch=14] Controller 1 epoch 14 failed to change state for partition __consumer_offsets-39 from OfflinePartition to OnlinePartition kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-39 under strategy PreferredReplicaPartitionLeaderElectionStrategy at kafka.controller.PartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:328) ~[kafka_2.11-1.1.0.jar:?] at kafka.controller.PartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:326) ~[kafka_2.11-1.1.0.jar:?] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) ~[scala-library-2.11.12.jar:?] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) ~[scala-library-2.11.12.jar:?] at kafka.controller.PartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:326) ~[kafka_2.11-1.1.0.jar:?] at kafka.controller.PartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:254) [kafka_2.11-1.1.0.jar:?] at kafka.controller.PartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:175) [kafka_2.11-1.1.0.jar:?] at kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:116) [kafka_2.11-1.1.0.jar:?] at kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:604) [kafka_2.11-1.1.0.jar:?] at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3$$anonfun$apply$18.apply(KafkaController.scala:1000) [kafka_2.11-1.1.0.jar:?] at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3$$anonfun$apply$18.apply(KafkaController.scala:993) [kafka_2.11-1.1.0.jar:?] at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134) [scala-library-2.11.12.jar:?] at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134) [scala-library-2.11.12.jar:?] at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) [scala-library-2.11.12.jar:?] at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) [scala-library-2.11.12.jar:?] at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134) [scala-library-2.11.12.jar:?] at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:993) [kafka_2.11-1.1.0.jar:?] at kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:980) [kafka_2.11-1.1.0.jar:?] at scala.collection.immutable.Map$Map4.foreach(Map.scala:188) [scala-library-2.11.12.jar:?] at kafka.controller.KafkaController.kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance(KafkaController.scala:980) [kafka_2.11-1.1.0.jar:?] at kafka.controller.KafkaController$AutoPreferredReplicaLeaderElection$.process(KafkaController.scala:1014) [kafka_2.11-1.1.0.jar:?] at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:69) [kafka_2.11-1.1.0.jar:?] at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69) [kafka_2.11-1.1.0.jar:?] at kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69) [kafka_2.11-1.1.0.jar:?] at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) [kafka_2.11-1.1.0.jar:?] at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:68) [kafka_2.11-1.1.0.jar:?] at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [kafka_2.11-1.1.0.jar:?] We used to be able to fix offline partitions in 0.10 by restarting the whole cluster, and after the upgrade we have to revert unclean.leader.election.enable to true for the restart to work. My understanding is that doing unclean leader election could potentially lose data, the question is that is there an alternative way to fix offline partitions? Thanks, Di Shang