0
<https://stackoverflow.com/posts/79777775/timeline>
*Summary* In our Kafka 3.6.1 cluster (KRaft mode with 9 brokers and 3
dedicated controllers), network threads occasionally saturate (idle percent
→ 0) on only a subset of brokers. Partition leadership and the number of
client connections are evenly distributed across brokers. Still, the
problem is broker-specific and intermittent.
*Cluster Setup*
- Kafka 3.6.1 (KRaft).
- 9 brokers + 3 dedicated controllers.
- Node size: 24 vCPU, 48 GB RAM.
- ~35k partitions total (≈3950 per broker).
- ~80k connections total (≈8.5k per broker).
- Leaders and connections are evenly distributed.
*Broker Configuration (full)*
group.initial.rebalance.delay.ms=0log.retention.check.interval.ms=30000
log.retention.hours=24
log.roll.hours=1
log.segment.bytes=1073741824
num.io.threads=24
num.network.threads=24
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=3
socket.receive.buffer.bytes=-1
socket.request.max.bytes=10485760
socket.send.buffer.bytes=-1
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3zookeeper.connection.timeout.ms=10000
delete.topic.enable=True
replica.fetch.max.bytes=5242880
max.message.bytes=5242880
message.max.bytes=5242880
default.replication.factor=3
min.insync.replicas=2
num.replica.fetchers=2replica.fetch.wait.max.ms=500replica.lag.time.max.ms=30000controller.quorum.election.timeout.ms=2000controller.quorum.request.timeout.ms=4000
socket.listen.backlog.size=500
queued.max.requests=1000
*Observed Impact*
- Only certain brokers experience network processor idle percent = 0.
- Other brokers remain unaffected at the same time.
- CPU and memory remain low.
- Consumers experience delayed fetches on impacted brokers.
- The issue is intermittent, sometimes appearing in the morning.
- Connection counts are roughly equal across brokers.
*Questions*
1. What could cause uneven saturation of network threads if both
partition leadership and client connections are evenly balanced?
2. Could client behavior (e.g. many small fetch requests, frequent
reconnects, or misconfigured apps/connectors) overload a subset of brokers
regardless of connection count?
3. Are there recommended tuning parameters beyond num.network.threads
and queued.max.requests (e.g. max.connections, max.connections.per.ip,
quotas) to protect brokers from such spikes?
4. What diagnostic steps do you recommend to pinpoint specific
producers/consumers causing this load pattern?