Broker outage sent controller into infinite loop

Karolis Pocius Mon, 07 Nov 2016 04:06:08 -0800

I'm managing a 21 node Kafka cluster, running 0.8.2.1. Recently onebroker crashed during software RAID rebuild after one of the OS disksfailed (Kafka logs are stored on separate RAID1+0 drive).

At first the controller behaved as expected, marking the node as downand sending out messages to remaining brokers:

[2016-11-04 15:45:03,130] INFO [Controller 20]: Shutting down broker 11(kafka.controller.KafkaController)[2016-11-04 15:45:03,134] DEBUG [Controller 20]: All shutting downbrokers: 11 (kafka.controller.KafkaController)[2016-11-04 15:45:03,134] DEBUG [Controller 20]: Live brokers:5,10,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,8,4,15(kafka.controller.KafkaController)[2016-11-04 15:45:03,147] DEBUG [Controller 20]: Removing replica 11from ISR 5,11,12 for partition [topic_xx,3].(kafka.controller.KafkaController)

The last line was repeated for each partition on the failed broker untilit reached __consumer_offsets,39:

[2016-11-04 15:45:03,447] DEBUG [Controller 20]: Removing replica 11from ISR 12,16,11 for partition [__consumer_offsets,39].(kafka.controller.KafkaController)[2016-11-04 15:45:03,450] INFO [Controller 20]: New leader and ISR forpartition [__consumer_offsets,39] is{"leader":16,"leader_epoch":258,"isr":[12,16]}(kafka.controller.KafkaController)[2016-11-04 15:45:03,451] DEBUG The stop replica request (delete = true)sent to broker 11 is (kafka.controller.ControllerBrokerRequestBatch)[2016-11-04 15:45:03,451] DEBUG The stop replica request (delete =false) sent to broker 11 is[Topic=__consumer_offsets,Partition=39,Replica=11](kafka.controller.ControllerBrokerRequestBatch)


After which (12 seconds later) broker change listener fired:

[2016-11-04 15:45:15,008] INFO [BrokerChangeListener on Controller 20]:Broker change listener fired for path /brokers/ids with children22,17,18,15,16,13,14,12,21,3,2,20,1,10,7,6,5,4,9,8(kafka.controller.ReplicaStateMachine$BrokerChangeListener)


And then the logs just kept looping this:

[2016-11-04 15:45:33,780] WARN [Controller-20-to-broker-11-send-thread],Controller 20 epoch 103 fails to send request Name: StopReplicaRequest;Version: 0; CorrelationId: 8831; ClientId: ; DeletePartitions: false;ControllerId: 20; ControllerEpoch: 103; Partitions:[__consumer_offsets,39] to broker id:11,host:10.0.12.41,port:9092.Reconnecting to broker. (kafka.controller.RequestSendThread)[2016-11-04 15:46:03,810] INFO [Controller-20-to-broker-11-send-thread],Controller 20 connected to id:11,host:10.0.12.41,port:9092 for sendingstate change requests (kafka.controller.RequestSendThread)

While it was happening, most producers were still trying to reach thefailed node and throwing an I/O error. An hour later, after restartingthe controller everything went back to normal.

I searched for known issues and this seems to be the closest to whathappened https://issues.apache.org/jira/browse/KAFKA-2125 but it'smarked as "Resolved" and not quite clear what was fixed and where. Itmentions connection quotas, but each broker has max.connections.per.ipset to 1000, so it's not likely that this limit was reached andmonitoring tools do not show any abnormal growth in connectionsfollowing the initial broker crash.

Also, what I find strange is that logs seem to indicate controllermanaged to connect to the dead broker each time, but then timed out. Iknow for a fact that broker was unreachable, so either the log messageis worded incorrectly, or it was in fact connecting to something else.


Any ideas as to what happened here or where to look for more clues?

Broker outage sent controller into infinite loop

Reply via email to