We're running 0.10.1.0 on a five-node cluster.

I was in the process of migrating some topics from having 2 replicas to
having three replicas when two the five machines in this cluster crashed
(brokers 2 and 3).

After restarting them, all of the topics that were previously assigned to
them are unavailable and showing "Leader: -1".

Example kafka-topics output:

% kafka-topics.sh --zookeeper $ZK_HP --describe  --unavailable-partitions
Topic: __consumer_offsets Partition: 9 Leader: -1 Replicas: 3,2,4 Isr:
Topic: __consumer_offsets Partition: 13 Leader: -1 Replicas: 3,2,4 Isr:
Topic: __consumer_offsets Partition: 17 Leader: -1 Replicas: 3,2,5 Isr:
Topic: __consumer_offsets Partition: 23 Leader: -1 Replicas: 5,2,1 Isr:
Topic: __consumer_offsets Partition: 25 Leader: -1 Replicas: 3,2,5 Isr:
Topic: __consumer_offsets Partition: 26 Leader: -1 Replicas: 3,2,1 Isr:
Topic: __consumer_offsets Partition: 30 Leader: -1 Replicas: 3,1,2 Isr:
Topic: __consumer_offsets Partition: 33 Leader: -1 Replicas: 1,2,4 Isr:
Topic: __consumer_offsets Partition: 35 Leader: -1 Replicas: 1,2,5 Isr:
Topic: __consumer_offsets Partition: 39 Leader: -1 Replicas: 3,1,2 Isr:
Topic: __consumer_offsets Partition: 40 Leader: -1 Replicas: 3,4,2 Isr:
Topic: __consumer_offsets Partition: 44 Leader: -1 Replicas: 3,1,2 Isr:
Topic: __consumer_offsets Partition: 45 Leader: -1 Replicas: 1,3,2 Isr:

​Note that I wasn't even moving any of the __consumer_offsets partitions,
so I'm not sure if the fact that a reassignment was in progress is a red
herring or not.

The logs are full of

ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
server experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-3], Error for partition [tracking.syslog,2]
to broker 3:org.apache.kafka.common.errors.UnknownServerException: The
server experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-3], Error for partition
[epostg.request_log_v1,0] to broker
3:org.apache.kafka.common.errors.UnknownServerException: The server
experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
ERROR [ReplicaFetcherThread-0-3], Error for partition
[epostg.request_log_v1,0] to broker
3:org.apache.kafka.common.errors.UnknownServerException: The server
experienced an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)​

​What can I do to fix this? Should I manually reassign all partitions that
were led by brokers 2 or 3 to only have whatever the third broker was in
their replica-set as their replica set? Do I need to temporarily enable
unclean elections?

I've never seen a cluster fail this way...​

-- 
James Brown
Engineer

Reply via email to