I'm managing a 21 node Kafka cluster, running 0.8.2.1. Recently one
broker crashed during software RAID rebuild after one of the OS disks
failed (Kafka logs are stored on separate RAID1+0 drive).
At first the controller behaved as expected, marking the node as down
and sending out messages to remaining brokers:
[2016-11-04 15:45:03,130] INFO [Controller 20]: Shutting down broker 11
(kafka.controller.KafkaController)
[2016-11-04 15:45:03,134] DEBUG [Controller 20]: All shutting down
brokers: 11 (kafka.controller.KafkaController)
[2016-11-04 15:45:03,134] DEBUG [Controller 20]: Live brokers:
5,10,14,20,1,6,21,9,13,2,17,22,12,7,3,18,16,8,4,15
(kafka.controller.KafkaController)
[2016-11-04 15:45:03,147] DEBUG [Controller 20]: Removing replica 11
from ISR 5,11,12 for partition [topic_xx,3].
(kafka.controller.KafkaController)
The last line was repeated for each partition on the failed broker until
it reached __consumer_offsets,39:
[2016-11-04 15:45:03,447] DEBUG [Controller 20]: Removing replica 11
from ISR 12,16,11 for partition [__consumer_offsets,39].
(kafka.controller.KafkaController)
[2016-11-04 15:45:03,450] INFO [Controller 20]: New leader and ISR for
partition [__consumer_offsets,39] is
{"leader":16,"leader_epoch":258,"isr":[12,16]}
(kafka.controller.KafkaController)
[2016-11-04 15:45:03,451] DEBUG The stop replica request (delete = true)
sent to broker 11 is (kafka.controller.ControllerBrokerRequestBatch)
[2016-11-04 15:45:03,451] DEBUG The stop replica request (delete =
false) sent to broker 11 is
[Topic=__consumer_offsets,Partition=39,Replica=11]
(kafka.controller.ControllerBrokerRequestBatch)
After which (12 seconds later) broker change listener fired:
[2016-11-04 15:45:15,008] INFO [BrokerChangeListener on Controller 20]:
Broker change listener fired for path /brokers/ids with children
22,17,18,15,16,13,14,12,21,3,2,20,1,10,7,6,5,4,9,8
(kafka.controller.ReplicaStateMachine$BrokerChangeListener)
And then the logs just kept looping this:
[2016-11-04 15:45:33,780] WARN [Controller-20-to-broker-11-send-thread],
Controller 20 epoch 103 fails to send request Name: StopReplicaRequest;
Version: 0; CorrelationId: 8831; ClientId: ; DeletePartitions: false;
ControllerId: 20; ControllerEpoch: 103; Partitions:
[__consumer_offsets,39] to broker id:11,host:10.0.12.41,port:9092.
Reconnecting to broker. (kafka.controller.RequestSendThread)
[2016-11-04 15:46:03,810] INFO [Controller-20-to-broker-11-send-thread],
Controller 20 connected to id:11,host:10.0.12.41,port:9092 for sending
state change requests (kafka.controller.RequestSendThread)
While it was happening, most producers were still trying to reach the
failed node and throwing an I/O error. An hour later, after restarting
the controller everything went back to normal.
I searched for known issues and this seems to be the closest to what
happened https://issues.apache.org/jira/browse/KAFKA-2125 but it's
marked as "Resolved" and not quite clear what was fixed and where. It
mentions connection quotas, but each broker has max.connections.per.ip
set to 1000, so it's not likely that this limit was reached and
monitoring tools do not show any abnormal growth in connections
following the initial broker crash.
Also, what I find strange is that logs seem to indicate controller
managed to connect to the dead broker each time, but then timed out. I
know for a fact that broker was unreachable, so either the log message
is worded incorrectly, or it was in fact connecting to something else.
Any ideas as to what happened here or where to look for more clues?