
We are running Apache Kafka v2.7.1 on a total of 54 brokers distributed
evenly across 3 racks.  All machines are identical (c6g.4xlarge Amazon AWS
EC2) and have 32 GB of RAM, of which 12 GB we dedicate to the JVM heap.

This cluster hosts some thousands of topics (each replicated to all 3
racks) and a total of ~3,600 partitions per broker.  Except for a handful
of log-compacted topics, the retention time is <= 4 days.

Around 3 weeks ago we enabled idempotent producer config for our
application.  Initially this didn't result in any obvious change in cluster
stability, but recently we've got alerted that one of the brokers crashed
with OutOfMemoryError...

We've taken a closer look and found out that the heap usage (as well as G1
old generation usage) was slowly but surely growing on every broker during
the course of ~10 days.  Prior to the changes it was in the range of 3-10
GB and went up to 8-12 GB around the time of problem detection.

As we haven't changed anything else in the middle, we decided to revert the
idempotent config change.  This didn't result in any immediate heap usage
change for the brokers, but now after a few days running with the "new" old
setup we are starting to see the opposite trend, with heap usage going back
to the expected levels.

By examining Kafka source code[1] we've learned that producer snapshots
(which are keeping known producer IDs and other information supplied by
idempotent producers) are stored in special .snapshot files next to the log
segments.  On one of the brokers we've checked, these snapshot files
amounted to 1.7 GB on disk in total.

Finally, here are the questions that we have:

1. Is such a dramatic increase in heap usage expected given the number of
partitions per broker?

2. Is there a way to calculate the extra heap requirements before enabling
the idempotent producer config?

3. Are there any best practices / community experience that we might be
ignoring here?

Thank you.


Kind regards,

Reply via email to