Hi all.

We've faced strange broker behaviour in our cluster and are out of thoughts
now what could cause it now. One of the brokers is always entering stage
with ~90% CPU usage and 5-30% CPU throttling without any visible reason. We
have 5 broker cluster with 1 KRaft controller using the latest 3.3.1
release (Bitnami docker images in K8s).

All the brokers have 1000-1500 topics with 1 partition each and replication
factor 1. K8s pods are configured to have 10GiB heap size and 6 CPUs.
Thread states almost do not change (runnable-blocked-etc), just CPU usage
increases. Network idle is around 76%. We have 10 producers (6 working
mostly at a constant rate, others via periodic tasks so distribution is not
normal for them), and 20 consumers (4 of them consuming the most).
Distribution for data load is per 10m: broker IN bytes 19% (250 MiB),
broker OUT bytes 81% (1.6GiB), active topics receiving publishes 44%
(~900), active topics fetched by consumers 66% (~1200).

The only correlation we see now is that after shutting down 3 of 6 most
active publishers CPU gets back to normal for this problem broker. However
the load was relatively the same for all of them, not causing any issues
for other brokers. Linger for publishers is 50ms, consumers have min.fetch
bytes 10KiB with max bytes being default (50MiB). Also one of applications
that acts as one of those publishers that after a shutdown brings the
broker to normal state, also consumes messages and there is stable increase
in poll interval for it up to 30s when it starts consuming max.fetch.bytes
(50MiB). Purgatory is used only for fetch requests on all nodes.

Don't know what else can be helpful to understand the issue. Would greatly
appreciate it if anyone has thoughts on where to look for the cause of this
isolated CPU usage increase.

Reply via email to