When you said "The only difference we could see is that thread usage
decreases during these period", did you mean thread usage increases?
You can monitor the usage of two different thread pools, network
thread pool and requestHandler thread pool. If none of them are high
and yet, you have a large CPU spike, then it is probably due to
background threads responsible for cleaning/compaction or some other
JVM process that runs on your broker. If network pool is high, then
it's probably due to large number of requests. If requestHandler is
high then a small number of requests are causing large CPU spikes
while processing. Some reasons that can happen is if you send data in
format V1 but server expects in V2 and hence, CPU will be spent in
converting from V1 to V2 format on the server. This becomes worse when
data is compressed because the server has to do
decompression-compression-decompression-compression before actually
writing to the disk. Another reason could be your metrics scraping
very aggressively in case you are using something like prometheus.

To have a more deterministic estimate of where your CPU is being
spent, I would suggest to attach a profiler and get a snapshot. I
recently gave a talk during Kafka Summit on how you could use
flamegraphs obtained via profiler to find hotspots in your code. See:
https://www.confluent.io/events/kafka-summit-london-2023/unveiling-the-inner-workings-of-apache-kafka-r-with-flamegraphs/

--
Divij Vaidya

On Wed, Aug 9, 2023 at 4:25 PM sunil chaudhari
<sunilmchaudhar...@gmail.com> wrote:
>
> Point 2 may impact if the size of partitions is too big.
> too many log segments will cause those many iops
> I am not expert though
>
> On Wed, 9 Aug 2023 at 6:43 PM, Tiansu Yu <tiansu...@klarna.com.invalid>
> wrote:
>
> > 1. We use cruise-control to actively balance the partitions across all
> > brokers. So point 1 could be ruled out.
> > 2. I am not sure how much this would impact the broker, as we do have some
> > exceptionally large partitions around. I have to check to know if they live
> > on the aforementioned broker. So far I don't see there is strong
> > correlation between total producer / consumer byte rates with CPU spikes on
> > this broker.
> >
> > Tiansu Yu
> > Engineer
> > Data Ingestion & Streaming
> >
> > Klarna Bank AB German Branch
> > Chausseestraße 117
> > <https://www.google.com/maps/search/Chausseestra%C3%9Fe+117+10115+Berlin?entry=gmail&source=g>
> > 10115 Berlin
> > <https://www.google.com/maps/search/Chausseestra%C3%9Fe+117+10115+Berlin?entry=gmail&source=g>
> > Tel: +49 221 669 501 00
> > klarna.de
> >
> > Klarna Bank AB, German Branch
> > Sitz: Berlin, Amtsgericht Charlottenburg HRB 217291 B
> > USt-Nr.: DE 815 867 324
> > Zweigstelle der Klarna Bank AB (publ), AG schwedischen Rechts mit
> > Hauptsitz in Stockholm,
> > Schw. Gesellschaftsregister 556737-0431
> > Verwaltungsratsvorsitzender: Michael Moritz
> > Geschäftsführender Direktor: Sebastian Siemiatkowski
> > Leiter Zweigniederlassung: Yaron Shaer, Björn Petersen
> >
> > On 9. Aug 2023, at 12:05, sunil chaudhari <sunilmchaudhar...@gmail.com>
> > wrote:
> >
> > Hi I can guess two problems here.
> > 1. Either too many partition’s concentrated on this broker compared to
> > other broker
> > 2. The partitions on this broker might have larger size as compared to the
> > parition on other brokers
> >
> > please chech if all brokers are evenly balanced in terms of number of
> > partitions and the total topic size on each broker.
> >
> > On Wed, 9 Aug 2023 at 1:29 PM, Tiansu Yu <tiansu...@klarna.com.invalid>
> > wrote:
> >
> > Hi Kafka community,
> >
> > We have an issue with our Kafka cluster from time to time, that a single
> > (one and only one) broker (leader) in the cluster reaches 100% CPU
> > utilisation. We could not see any apparent issue from the metrics. There is
> > no heap memory usage increase, no excessive connections made on the broker,
> > no misbehaving producers and consumers trying to dump or load excessively
> > during these periods. The only difference we could see is that thread usage
> > decreases during these period. Despite the problem, the service is still
> > available (understandable from Kafka's perspective.)
> >
> > We are trying to understand what else might be the cause of the issue and
> > how we can mitigate them.
> >
> > Tiansu Yu
> > Engineer
> > Data Ingestion & Streaming
> >
> > Klarna Bank AB German Branch
> > Chausseestraße 117
> > <https://www.google.com/maps/search/Chausseestra%C3%9Fe+117?entry=gmail&source=g>
> >
> > <
> > https://www.google.com/maps/search/Chausseestra%C3%9Fe+117+10115+Berlin?entry=gmail&source=g
> > >
> > 10115 Berlin
> > <
> > https://www.google.com/maps/search/Chausseestra%C3%9Fe+117+10115+Berlin?entry=gmail&source=g
> > >
> >
> >
> > Tel: +49 221 669 501 00
> > klarna.de
> >
> > Klarna Bank AB, German Branch
> > Sitz: Berlin, Amtsgericht Charlottenburg HRB 217291 B
> > USt-Nr.: DE 815 867 324
> > Zweigstelle der Klarna Bank AB (publ), AG schwedischen Rechts mit
> > Hauptsitz in Stockholm,
> > Schw. Gesellschaftsregister 556737-0431
> > Verwaltungsratsvorsitzender: Michael Moritz
> > Geschäftsführender Direktor: Sebastian Siemiatkowski
> > Leiter Zweigniederlassung: Yaron Shaer, Björn Petersen
> >
> >
> >

Reply via email to