I'll look into those tools, thanks! I was able to turn on the JMX polling and consumer metrics in kafka-manager. I now know which topic & partition is causing the problem. It's basically 80MB of a single partiton on a single topic being hit by 60'odd consumers. Now I need to figure out what that means.
Thanks! -Dylan ________________________________ From: Alex Woolford <a...@woolford.io> Sent: Saturday, February 8, 2020 10:09 PM To: users@kafka.apache.org <users@kafka.apache.org> Cc: Dylan Martin <dmar...@istreamplanet.com> Subject: Re: Confusingly unbalanced broker [EXTERNAL E-MAIL] That's a very intriguing question, Dylan. Even if the partitions for each of the topics are distributed evenly across the brokers, it's not guaranteed that the *data* will be distributed evenly. By default, the producer will send all the messages in a topic with the same key to the same partition. It's possible you have keyed messages, the cardinality of the key is very low, and a disproportionate portion of the messages are going to a single "hot" partition. One thing you could do, off the top of my head, is to take a peek at the file access events. For example, the following one-liner shows that on this particular node, there are a lot of writes to the `aprs` topic, partition 2: # fatrace --seconds 10 | sort | uniq -c | sort -nr | head 161 java(1928): W /var/lib/kafka/aprs-2/00000000000081049867.log 155 java(1928): R /var/lib/kafka/_confluent-metrics-2/00000000000031360445.log 148 java(1928): R /var/lib/kafka/conn-0/00000000000029833400.log 136 ossec-agentd(1733): R /var/ossec/etc/shared/merged.mg<http://merged.mg> 129 osqueryd(2201): O /etc/passwd 104 java(1928): R /var/lib/kafka/_confluent-monitoring-0/00000000000046052008.log 95 osqueryd(2201): RC /etc/passwd 91 osqueryd(2201): RCO /etc/passwd 79 java(1928): R /var/lib/kafka/_confluent-controlcenter-5-4-0-1-MetricsAggregateStore-repartition-2/00000000000414771172.log 64 java(1928): R /var/lib/kafka/_confluent-controlcenter-5-4-0-1-monitoring-message-rekey-store-1/00000000000002063409.log I'm running CentOS 7. Here's what I did to install fatrace: wget https://dl.fedoraproject.org/pub/fedora/linux/releases/31/Everything/source/tree/Packages/f/fatrace-0.13-5.fc31.src.rpm rpm -i fatrace-0.13-5.fc31.src.rpm yum install bzip2 tar xvf /root/rpmbuild/SOURCES/fatrace-0.13.tar.bz2 cd fatrace-0.13 make make install You could also poke around in the filesystem, perhaps using `ncdu`, to see which topics/partitions are consuming the disk. For example, `ncdu /var/lib/kafka` shows that partition 0 of my syslog topic is consuming most of the space on this particular broker: --- /var/lib/kafka ------------------- 61.1 GiB [##########] /syslog-0 6.4 GiB [# ] /aprs-0 3.7 GiB [ ] /syslog-7 3.7 GiB [ ] /syslog-9 Hopefully, someone with better Kafka-fu can suggest a more native way to understand, at the partition level, what's causing this behavior. HTH, Alex Woolford On Fri, Feb 7, 2020 at 2:38 PM Dylan Martin <dmar...@istreamplanet.com<mailto:dmar...@istreamplanet.com>> wrote: Hi all! I have a cluster of about 20 brokers and one of them is transmitting about 4 times as much data as the others (80mB/sec vs 20mB/sec). It has the roughly the same number of topics & partitions and it's the leader for the same number as all the other brokers. The kafka-manager web tool doesn't say it's doing a particuarly large amount of work. Datadog & iftop both agree that it's sending out 4 times as much traffic as any of the others. It's very consistent, in that it's been this way for weeks. Any advice on how to track down what's going on? Thanks! -Dylan The information contained in this email message, and any attachment thereto, is confidential and may not be disclosed without the sender's express permission. If you are not the intended recipient or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution or copying of this message, or any attachment thereto, in whole or in part, is strictly prohibited. If you have received this message in error, please immediately notify the sender by telephone, fax or email and delete the message and all of its attachments. Thank you.