Hello,
We have a kafka streams service that performs a left join (KStreams)
operation by their message key. The message size is 1 KB more or less.
The left topic has around two hundred thousand (200,000) messages per
second, while the right topic has around two thousand (2000) message per
second.
Each topic has 96 partitions with 3 hours retention time.
The join operation has 15 minutes time window.
We have a very concerning problem that the consumer keep getting
rebalancing after some time and the consumer lag can accumulate to 200
millions or more.
We have doing a lot of tuning and test on our service but the problem
still persist. The consumer start rebalancing when we have 100,000 tps
or more.
Below are the latest configuration we use:
- auto.offset.reset = latest
- session.timeout.ms = 300000
- heartbeat.interval.ms = 75000
- fetch.max.wait.ms = 500
- fetch.min.bytes = 1048576
- fetch.max.bytes = 52428800
- max.partition.fetch.bytes = 1048576
- max.poll.interval.ms = 300000
- max.poll.records = 1
- request.timeout.ms = 120000
- default.api.timeout.ms = 60000
- cache.max.bytes.buffering = 104857600
- num.stream.threads = 48 (each pod)
Below are some info regarding kafka cluster we are using:
- 10 brokers (8 core 2.3GHz, 16GB RAM)
- kafka version = 3.8.0
- kafka-streams library = 3.1.2
The service running in a dockerized container with 2 pods (we have tried
up to 4 pods too).
Could someone provide inside or solutions so we can have a stable and
fast consumer to handle this kind of messages?
Thank you!
Martinus Elvin