Hello Kafka Users, We experienced an issue recently where a single Kafka broker reported that, for all of the topic-partitions for which it was leader:
[2017-05-16 22:48:47,760] INFO Partition [<TOPIC>,<PARTITION>] on broker 561: Shrinking ISR for partition [<TOPIC>,<PARTITION>] from <B1>,<B2>,561 to 561 (kafka.cluster.Partition) This message occurred for 128 separate partition-topic combinations very rapidly - from the first of these messages to the last is only 200ms in the timestamp of the log message. At the same time, we noticed a spike in the metrics kafka.network.RequestChannel.{RequestQueueSize,ReponseQueueSize} The RequestQueueSize went from its usual value of 0 or 1 up to 16, and the ResponseQueueSize jumped from the typical 0 or 1 all the way to 127 (it's interesting that this quantity 127 is one less than the total number of partitions (128) for which the broker decided it was the sole ISR). These queues stayed at this constant, elevated level until the broker was manually restarted about an hour later. There were no signs of networking problems in dmesg or /var/log/syslog, although there was a dhclient message a couple minutes before the incident: May 16 22:47:48 production-kafka-3b dhclient: DHCPREQUEST of 10.0.130.239 on eth0 to 10.0.130.1 port 67 (xid=0x40dd7b45) But these messages are frequent (approximately 30 minutes) and have never been a cause of Kafka problems in the past, so I doubt any relation. The next minute after these messages, on the controller broker (a different broker from the one mentioned above), we saw: [2017-05-16 22:49:20,884] WARN [Controller-721-to-broker-561-send-thread], Controller 721 epoch 34 fails to send request {controller_id=721,controller_epoch=34,partition_states=[{topic=<TOPIC>,partition=34,controller_epoch=33,leader=561,leader_epoch=149,isr=[561],zk_version=329,replicas=[561,254,267]},...rack=us-east-1c}]} to broker production-kafka-3b:9092 (id: 561 rack: null). Reconnecting to broker. (kafka.controller.RequestSendThread) java.io.IOException: Connection to 561 was disconnected before the response was read So the broker production-kafka-3b was having trouble both syncing with its followers, according to its own logs, and also the controller was unable to communicate with it, as the controller's logs show. However, the host itself did not seem to have general networking problems - we could ssh to the host without any perceptible latency. Therefore the problem seems to lie somewhere in Kafka's handling of network resources. I found https://issues.apache.org/jira/browse/KAFKA-4477, which sounds very similar to our situation, but it said to be fixed in 0.10.1.1, which is the version we have been running for several months up to the present, and indeed up until recently we never saw this issue. I am also investigating whether https://issues.apache.org/jira/browse/KAFKA-5153 and https://issues.apache.org/jira/browse/KAFKA-5007 are relevant in our situation. I regret not getting a jstack and lsof for the process before we killed it, but will try to do so next time. Has anyone experienced this issue? Can anything be done operationally to mitigate this problem other than restarting the affected broker? Thanks, Eugene