Re: Kafka Streams Consumer Constantly Rebalance over 100k tps

Matthias J. Sax Fri, 17 Jan 2025 12:24:19 -0800

Hi,

In general, we recommend one StreamThread per core, so 48 threads soundsexcessive; I don't think that a single pod would get 48 cores? So usingmore pods with fewer threads each, might be a first good step.


The only config that sticks out is

> - max.poll.records = 1

Not sure why you reduce it to 1. This should reduce the throughput asingle thread can achieve. Why does the default not work for you?

However, in the end, the first thing you need to figure out it, whyrebalancing starts. Are heartbeats getting lost? Do instances hit`max.poll.interval.ms`? -- If threads get more busy with increasingload, I could imagine that you run into thread contention, destabilizingthe system.

Next, you should inspect metrics to get a idea how the system performs.What metrics increase/decrease while you increase throughput, and atwhat point it starts to fail over.

Both together should give you a starting point to understand what theissue could be, and what the appropriate change (more KS instance, fewerthreads, more memory, config change) should help.




-Matthias

On 1/9/25 7:30 PM, Martinus Elvin wrote:

Hello,
We have a kafka streams service that performs a left join (KStreams)operation by their message key. The message size is 1 KB more or less.The left topic has around two hundred thousand (200,000) messages persecond, while the right topic has around two thousand (2000) message persecond.
Each topic has 96 partitions with 3 hours retention time.
The join operation has 15 minutes time window.
We have a very concerning problem that the consumer keep gettingrebalancing after some time and the consumer lag can accumulate to 200millions or more.We have doing a lot of tuning and test on our service but the problemstill persist. The consumer start rebalancing when we have 100,000 tpsor more.
Below are the latest configuration we use:
- auto.offset.reset = latest
- session.timeout.ms = 300000
- heartbeat.interval.ms = 75000
- fetch.max.wait.ms = 500
- fetch.min.bytes = 1048576
- fetch.max.bytes = 52428800
- max.partition.fetch.bytes = 1048576
- max.poll.interval.ms = 300000
- max.poll.records = 1
- request.timeout.ms = 120000
- default.api.timeout.ms = 60000
- cache.max.bytes.buffering = 104857600
- num.stream.threads = 48 (each pod)

Below are some info regarding kafka cluster we are using:
- 10 brokers (8 core 2.3GHz, 16GB RAM)
- kafka version = 3.8.0
- kafka-streams library = 3.1.2
The service running in a dockerized container with 2 pods (we have triedup to 4 pods too).Could someone provide inside or solutions so we can have a stable andfast consumer to handle this kind of messages?
Thank you!


Martinus Elvin

Re: Kafka Streams Consumer Constantly Rebalance over 100k tps

Reply via email to