We have a 9 node cluster running 0.8.2.1 that does around 545 thousand messages(kafka-messages-in) per second. Each of our brokers has 30 G of memory and 16 cores. We give the brokers themselves 2G of heap. Each broker ranges from around 33 - 40% cpu utilization. The values for both kafka-bytes-in and kafka-messages-in are pretty much the same across servers which seems to suggest that there isn't much imbalance. Our average message size (kafka-bytes-in / kafka-messages-in) is around 80 bytes for all brokers.
The 95th percentile of the log flush times on each broker is usually 20 ms or lower. But periodically (maybe every 4-5 hrs) for a period of about 30 minutes one of the brokers has it's log flush latency 95th go up to 600 ms or even higher. It is not always the same broker. This causes serious bubbles in our pipeline and some of the 99th percentile of our producer to consumer latency goes to 20 seconds from the steady state of under 300 ms. When the log flush latency goes up the number of under-replicated partitions also goes up. The CPU on the node does not show any sizable increase (maybe a couple % absolute if I really squint). The network traffic on the node goes down, probably from the back-pressure from slow log flush. It doesn't seem like the affected broker is receiving more messages than others during the time its log flush time goes up. It is quite possible that we are under-provisioned since this periodic log flush bubbles started happening recently as our load started going up. I don't see anything suspicious in the logs of the machines that are suffering from the higher log flush times. I want to increase capacity, but it would be good to understand that this is a provisioning issue and not something else before I add more machines to the cluster. Thanks, Rajiv