0
<https://stackoverflow.com/posts/79777775/timeline>

*Summary* In our Kafka 3.6.1 cluster (KRaft mode with 9 brokers and 3
dedicated controllers), network threads occasionally saturate (idle percent
→ 0) on only a subset of brokers. Partition leadership and the number of
client connections are evenly distributed across brokers. Still, the
problem is broker-specific and intermittent.

*Cluster Setup*

   - Kafka 3.6.1 (KRaft).
   - 9 brokers + 3 dedicated controllers.
   - Node size: 24 vCPU, 48 GB RAM.
   - ~35k partitions total (≈3950 per broker).
   - ~80k connections total (≈8.5k per broker).
   - Leaders and connections are evenly distributed.

*Broker Configuration (full)*

group.initial.rebalance.delay.ms=0log.retention.check.interval.ms=30000
log.retention.hours=24
log.roll.hours=1
log.segment.bytes=1073741824
num.io.threads=24
num.network.threads=24
num.recovery.threads.per.data.dir=2
offsets.topic.replication.factor=3
socket.receive.buffer.bytes=-1
socket.request.max.bytes=10485760
socket.send.buffer.bytes=-1
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3zookeeper.connection.timeout.ms=10000
delete.topic.enable=True
replica.fetch.max.bytes=5242880
max.message.bytes=5242880
message.max.bytes=5242880
default.replication.factor=3
min.insync.replicas=2
num.replica.fetchers=2replica.fetch.wait.max.ms=500replica.lag.time.max.ms=30000controller.quorum.election.timeout.ms=2000controller.quorum.request.timeout.ms=4000
socket.listen.backlog.size=500
queued.max.requests=1000

*Observed Impact*

   - Only certain brokers experience network processor idle percent = 0.
   - Other brokers remain unaffected at the same time.
   - CPU and memory remain low.
   - Consumers experience delayed fetches on impacted brokers.
   - The issue is intermittent, sometimes appearing in the morning.
   - Connection counts are roughly equal across brokers.

*Questions*

   1. What could cause uneven saturation of network threads if both
   partition leadership and client connections are evenly balanced?
   2. Could client behavior (e.g. many small fetch requests, frequent
   reconnects, or misconfigured apps/connectors) overload a subset of brokers
   regardless of connection count?
   3. Are there recommended tuning parameters beyond num.network.threads
    and queued.max.requests (e.g. max.connections, max.connections.per.ip,
   quotas) to protect brokers from such spikes?
   4. What diagnostic steps do you recommend to pinpoint specific
   producers/consumers causing this load pattern?

Reply via email to