We're running 0.10.1.0 on a five-node cluster. I was in the process of migrating some topics from having 2 replicas to having three replicas when two the five machines in this cluster crashed (brokers 2 and 3).
After restarting them, all of the topics that were previously assigned to them are unavailable and showing "Leader: -1". Example kafka-topics output: % kafka-topics.sh --zookeeper $ZK_HP --describe --unavailable-partitions Topic: __consumer_offsets Partition: 9 Leader: -1 Replicas: 3,2,4 Isr: Topic: __consumer_offsets Partition: 13 Leader: -1 Replicas: 3,2,4 Isr: Topic: __consumer_offsets Partition: 17 Leader: -1 Replicas: 3,2,5 Isr: Topic: __consumer_offsets Partition: 23 Leader: -1 Replicas: 5,2,1 Isr: Topic: __consumer_offsets Partition: 25 Leader: -1 Replicas: 3,2,5 Isr: Topic: __consumer_offsets Partition: 26 Leader: -1 Replicas: 3,2,1 Isr: Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 3,1,2 Isr: Topic: __consumer_offsets Partition: 33 Leader: -1 Replicas: 1,2,4 Isr: Topic: __consumer_offsets Partition: 35 Leader: -1 Replicas: 1,2,5 Isr: Topic: __consumer_offsets Partition: 39 Leader: -1 Replicas: 3,1,2 Isr: Topic: __consumer_offsets Partition: 40 Leader: -1 Replicas: 3,4,2 Isr: Topic: __consumer_offsets Partition: 44 Leader: -1 Replicas: 3,1,2 Isr: Topic: __consumer_offsets Partition: 45 Leader: -1 Replicas: 1,3,2 Isr: Note that I wasn't even moving any of the __consumer_offsets partitions, so I'm not sure if the fact that a reassignment was in progress is a red herring or not. The logs are full of ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2] to broker 3:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2] to broker 3:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) ERROR [ReplicaFetcherThread-0-3], Error for partition [epostg.request_log_v1,0] to broker 3:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) ERROR [ReplicaFetcherThread-0-3], Error for partition [epostg.request_log_v1,0] to broker 3:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) What can I do to fix this? Should I manually reassign all partitions that were led by brokers 2 or 3 to only have whatever the third broker was in their replica-set as their replica set? Do I need to temporarily enable unclean elections? I've never seen a cluster fail this way... -- James Brown Engineer