I'm managing a 21 node Kafka cluster, running 0.8.2.1. Recently one broker crashed during software RAID rebuild after one of the OS disks failed (Kafka logs are stored on separate RAID1+0 drive).

At first the controller behaved as expected, marking the node as down and sending out messages to remaining brokers:

[2016-11-04 15:45:03,130] INFO [Controller 20]: Shutting down broker 11 (kafka.controller.KafkaController) [2016-11-04 15:45:03,134] DEBUG [Controller 20]: All shutting down brokers: 11 (kafka.controller.KafkaController) [2016-11-04 15:45:03,134] DEBUG [Controller 20]: Live brokers: 5,10,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,8,4,15 (kafka.controller.KafkaController) [2016-11-04 15:45:03,147] DEBUG [Controller 20]: Removing replica 11 from ISR 5,11,12 for partition [topic_xx,3]. (kafka.controller.KafkaController)

The last line was repeated for each partition on the failed broker until it reached __consumer_offsets,39:

[2016-11-04 15:45:03,447] DEBUG [Controller 20]: Removing replica 11 from ISR 12,16,11 for partition [__consumer_offsets,39]. (kafka.controller.KafkaController) [2016-11-04 15:45:03,450] INFO [Controller 20]: New leader and ISR for partition [__consumer_offsets,39] is {"leader":16,"leader_epoch":258,"isr":[12,16]} (kafka.controller.KafkaController) [2016-11-04 15:45:03,451] DEBUG The stop replica request (delete = true) sent to broker 11 is (kafka.controller.ControllerBrokerRequestBatch) [2016-11-04 15:45:03,451] DEBUG The stop replica request (delete = false) sent to broker 11 is [Topic=__consumer_offsets,Partition=39,Replica=11] (kafka.controller.ControllerBrokerRequestBatch)

After which (12 seconds later) broker change listener fired:

[2016-11-04 15:45:15,008] INFO [BrokerChangeListener on Controller 20]: Broker change listener fired for path /brokers/ids with children 22,17,18,15,16,13,14,12,21,3,2,20,1,10,7,6,5,4,9,8 (kafka.controller.ReplicaStateMachine$BrokerChangeListener)

And then the logs just kept looping this:

[2016-11-04 15:45:33,780] WARN [Controller-20-to-broker-11-send-thread], Controller 20 epoch 103 fails to send request Name: StopReplicaRequest; Version: 0; CorrelationId: 8831; ClientId: ; DeletePartitions: false; ControllerId: 20; ControllerEpoch: 103; Partitions: [__consumer_offsets,39] to broker id:11,host:10.0.12.41,port:9092. Reconnecting to broker. (kafka.controller.RequestSendThread) [2016-11-04 15:46:03,810] INFO [Controller-20-to-broker-11-send-thread], Controller 20 connected to id:11,host:10.0.12.41,port:9092 for sending state change requests (kafka.controller.RequestSendThread)

While it was happening, most producers were still trying to reach the failed node and throwing an I/O error. An hour later, after restarting the controller everything went back to normal.

I searched for known issues and this seems to be the closest to what happened https://issues.apache.org/jira/browse/KAFKA-2125 but it's marked as "Resolved" and not quite clear what was fixed and where. It mentions connection quotas, but each broker has max.connections.per.ip set to 1000, so it's not likely that this limit was reached and monitoring tools do not show any abnormal growth in connections following the initial broker crash.

Also, what I find strange is that logs seem to indicate controller managed to connect to the dead broker each time, but then timed out. I know for a fact that broker was unreachable, so either the log message is worded incorrectly, or it was in fact connecting to something else.

Any ideas as to what happened here or where to look for more clues?

Reply via email to