Just an update. We moved all the partitions from the one topic that
generated most of the 545 thousand messages/second to their own set of
brokers. The old set of 9 brokers now only get  135 thousand
messages/second or 15 thousand messages/sec/broker. We are still seeing the
same log flush time issues on these lightly loaded brokers. These boxes
have 2 SSDs of 160 GB each and 30 G RAM (out of which 2G is in the Java
heap). The average message size on these brokers now is around 100 bytes.
Given how beefy these boxes are and how low the message throughput is (15k
messages/sec/broker of around a 100 bytes each) I'd think we are
over-provisioned. But we still see periodic jumps in log flush latency.

Any hints on what else we might measure/check etc to figure this out?


On Thu, Sep 17, 2015 at 4:39 PM, Rajiv Kurian <ra...@signalfx.com> wrote:

> We have a 9 node cluster running 0.8.2.1 that does around 545 thousand
> messages(kafka-messages-in) per second. Each of our brokers has 30 G of
> memory and 16 cores. We give the brokers themselves 2G of heap. Each broker
> ranges from around 33 - 40% cpu utilization. The values for both
> kafka-bytes-in and kafka-messages-in are pretty much the same across
> servers which seems to suggest that there isn't much imbalance. Our average
> message size (kafka-bytes-in / kafka-messages-in) is around 80 bytes for
> all brokers.
>
> The 95th percentile of the log flush times on each broker is usually 20 ms
> or lower. But periodically (maybe every 4-5 hrs) for a period of about 30
> minutes one of the brokers has it's log flush latency 95th go up to 600 ms
> or even higher. It is not always the same broker. This causes serious
> bubbles in our pipeline and some of the 99th percentile of our producer to
> consumer latency goes to 20 seconds from the steady state of under 300 ms.
>
> When the log flush latency goes up the number of under-replicated
> partitions also goes up. The CPU on the node does not show any sizable
> increase (maybe a couple % absolute if I really squint). The network
> traffic on the node goes down, probably from the back-pressure from slow
> log flush. It doesn't seem like the affected broker is receiving more
> messages than others during the time its log flush time goes up.
>
> It is quite possible that we are under-provisioned since this periodic log
> flush bubbles started happening recently as our load started going up. I
> don't see anything suspicious in the logs of the machines that are
> suffering from the higher log flush times. I want to increase capacity, but
> it would be good to understand that this is a provisioning issue and not
> something else before I add more machines to the cluster.
>
> Thanks,
> Rajiv
>

Reply via email to