We have a 9 node cluster running 0.8.2.1 that does around 545 thousand
messages(kafka-messages-in) per second. Each of our brokers has 30 G of
memory and 16 cores. We give the brokers themselves 2G of heap. Each broker
ranges from around 33 - 40% cpu utilization. The values for both
kafka-bytes-in and kafka-messages-in are pretty much the same across
servers which seems to suggest that there isn't much imbalance. Our average
message size (kafka-bytes-in / kafka-messages-in) is around 80 bytes for
all brokers.

The 95th percentile of the log flush times on each broker is usually 20 ms
or lower. But periodically (maybe every 4-5 hrs) for a period of about 30
minutes one of the brokers has it's log flush latency 95th go up to 600 ms
or even higher. It is not always the same broker. This causes serious
bubbles in our pipeline and some of the 99th percentile of our producer to
consumer latency goes to 20 seconds from the steady state of under 300 ms.

When the log flush latency goes up the number of under-replicated
partitions also goes up. The CPU on the node does not show any sizable
increase (maybe a couple % absolute if I really squint). The network
traffic on the node goes down, probably from the back-pressure from slow
log flush. It doesn't seem like the affected broker is receiving more
messages than others during the time its log flush time goes up.

It is quite possible that we are under-provisioned since this periodic log
flush bubbles started happening recently as our load started going up. I
don't see anything suspicious in the logs of the machines that are
suffering from the higher log flush times. I want to increase capacity, but
it would be good to understand that this is a provisioning issue and not
something else before I add more machines to the cluster.

Thanks,
Rajiv

Reply via email to