That's a very intriguing question, Dylan. Even if the partitions for each of the topics are distributed evenly across the brokers, it's not guaranteed that the *data* will be distributed evenly. By default, the producer will send all the messages in a topic with the same key to the same partition. It's possible you have keyed messages, the cardinality of the key is very low, and a disproportionate portion of the messages are going to a single "hot" partition.
One thing you could do, off the top of my head, is to take a peek at the file access events. For example, the following one-liner shows that on this particular node, there are a lot of writes to the `aprs` topic, partition 2: # fatrace --seconds 10 | sort | uniq -c | sort -nr | head 161 java(1928): W /var/lib/kafka/aprs-2/00000000000081049867.log 155 java(1928): R /var/lib/kafka/_confluent-metrics-2/00000000000031360445.log 148 java(1928): R /var/lib/kafka/conn-0/00000000000029833400.log 136 ossec-agentd(1733): R /var/ossec/etc/shared/merged.mg 129 osqueryd(2201): O /etc/passwd 104 java(1928): R /var/lib/kafka/_confluent-monitoring-0/00000000000046052008.log 95 osqueryd(2201): RC /etc/passwd 91 osqueryd(2201): RCO /etc/passwd 79 java(1928): R /var/lib/kafka/_confluent-controlcenter-5-4-0-1-MetricsAggregateStore-repartition-2/00000000000414771172.log 64 java(1928): R /var/lib/kafka/_confluent-controlcenter-5-4-0-1-monitoring-message-rekey-store-1/00000000000002063409.log I'm running CentOS 7. Here's what I did to install fatrace: wget https://dl.fedoraproject.org/pub/fedora/linux/releases/31/Everything/source/tree/Packages/f/fatrace-0.13-5.fc31.src.rpm rpm -i fatrace-0.13-5.fc31.src.rpm yum install bzip2 tar xvf /root/rpmbuild/SOURCES/fatrace-0.13.tar.bz2 cd fatrace-0.13 make make install You could also poke around in the filesystem, perhaps using `ncdu`, to see which topics/partitions are consuming the disk. For example, `ncdu /var/lib/kafka` shows that partition 0 of my syslog topic is consuming most of the space on this particular broker: --- /var/lib/kafka ------------------- 61.1 GiB [##########] /syslog-0 6.4 GiB [# ] /aprs-0 3.7 GiB [ ] /syslog-7 3.7 GiB [ ] /syslog-9 Hopefully, someone with better Kafka-fu can suggest a more native way to understand, at the partition level, what's causing this behavior. HTH, Alex Woolford On Fri, Feb 7, 2020 at 2:38 PM Dylan Martin <dmar...@istreamplanet.com> wrote: > Hi all! > > I have a cluster of about 20 brokers and one of them is transmitting about > 4 times as much data as the others (80mB/sec vs 20mB/sec). It has the > roughly the same number of topics & partitions and it's the leader for the > same number as all the other brokers. The kafka-manager web tool doesn't > say it's doing a particuarly large amount of work. Datadog & iftop both > agree that it's sending out 4 times as much traffic as any of the others. > It's very consistent, in that it's been this way for weeks. > > Any advice on how to track down what's going on? > > Thanks! > -Dylan > > > > The information contained in this email message, and any attachment > thereto, is confidential and may not be disclosed without the sender's > express permission. If you are not the intended recipient or an employee or > agent responsible for delivering this message to the intended recipient, > you are hereby notified that you have received this message in error and > that any review, dissemination, distribution or copying of this message, or > any attachment thereto, in whole or in part, is strictly prohibited. If you > have received this message in error, please immediately notify the sender > by telephone, fax or email and delete the message and all of its > attachments. Thank you. >