That's a very intriguing question, Dylan.

Even if the partitions for each of the topics are distributed evenly across
the brokers, it's not guaranteed that the *data* will be distributed
evenly. By default, the producer will send all the messages in a topic with
the same key to the same partition. It's possible you have keyed messages,
the cardinality of the key is very low, and a disproportionate portion of
the messages are going to a single "hot" partition.

One thing you could do, off the top of my head, is to take a peek at the
file access events. For example, the following one-liner shows that on this
particular node, there are a lot of writes to the `aprs` topic, partition 2:

# fatrace --seconds 10 | sort | uniq -c | sort -nr | head
   161 java(1928): W /var/lib/kafka/aprs-2/00000000000081049867.log
   155 java(1928): R
/var/lib/kafka/_confluent-metrics-2/00000000000031360445.log
   148 java(1928): R /var/lib/kafka/conn-0/00000000000029833400.log
   136 ossec-agentd(1733): R /var/ossec/etc/shared/merged.mg
   129 osqueryd(2201): O /etc/passwd
   104 java(1928): R
/var/lib/kafka/_confluent-monitoring-0/00000000000046052008.log
    95 osqueryd(2201): RC /etc/passwd
    91 osqueryd(2201): RCO /etc/passwd
    79 java(1928): R
/var/lib/kafka/_confluent-controlcenter-5-4-0-1-MetricsAggregateStore-repartition-2/00000000000414771172.log
    64 java(1928): R
/var/lib/kafka/_confluent-controlcenter-5-4-0-1-monitoring-message-rekey-store-1/00000000000002063409.log


I'm running CentOS 7. Here's what I did to install fatrace:

wget
https://dl.fedoraproject.org/pub/fedora/linux/releases/31/Everything/source/tree/Packages/f/fatrace-0.13-5.fc31.src.rpm
rpm -i fatrace-0.13-5.fc31.src.rpm
yum install bzip2
tar xvf /root/rpmbuild/SOURCES/fatrace-0.13.tar.bz2
cd fatrace-0.13
make
make install


You could also poke around in the filesystem, perhaps using `ncdu`, to see
which topics/partitions are consuming the disk. For example, `ncdu
/var/lib/kafka` shows that partition 0 of my syslog topic is consuming most
of the space on this particular broker:

--- /var/lib/kafka -------------------
  61.1 GiB [##########] /syslog-0
   6.4 GiB [#         ] /aprs-0
   3.7 GiB [          ] /syslog-7
   3.7 GiB [          ] /syslog-9


Hopefully, someone with better Kafka-fu can suggest a more native way to
understand, at the partition level, what's causing this behavior.

HTH,

Alex Woolford

On Fri, Feb 7, 2020 at 2:38 PM Dylan Martin <dmar...@istreamplanet.com>
wrote:

> Hi all!
>
> I have a cluster of about 20 brokers and one of them is transmitting about
> 4 times as much data as the others (80mB/sec vs 20mB/sec).  It has the
> roughly the same number of topics & partitions and it's the leader for the
> same number as all the other brokers.  The kafka-manager web tool doesn't
> say it's doing a particuarly large amount of work.  Datadog  & iftop both
> agree that it's sending out 4 times as much traffic as any of the others.
> It's very consistent, in that it's been this way for weeks.
>
> Any advice on how to track down what's going on?
>
> Thanks!
> -Dylan
>
>
>
> The information contained in this email message, and any attachment
> thereto, is confidential and may not be disclosed without the sender's
> express permission. If you are not the intended recipient or an employee or
> agent responsible for delivering this message to the intended recipient,
> you are hereby notified that you have received this message in error and
> that any review, dissemination, distribution or copying of this message, or
> any attachment thereto, in whole or in part, is strictly prohibited. If you
> have received this message in error, please immediately notify the sender
> by telephone, fax or email and delete the message and all of its
> attachments. Thank you.
>

Reply via email to