Hello,

we are observing Kafka brokers occasionally taking a very long time to load 
logs on startup compared to usual (40 minutes versus 30-60 seconds). This 
happens when restarting a broker (as part of a rolling restart following the 
procedure prescribed by Confluent here 
https://docs.confluent.io/current/kafka/post-deployment.html#rolling-restart). 
The broker has enabled controlled shutdown which was reported to be successful.

We use Confluent Platform 5.5.0 (Kafka 2.5.0), and 3 brokers with a replication 
factor of 3 and at least 2 in-sync replicas. Each broker uses 1TB of AWS EBS 
for log storage.

Some observations:

  *    It is not always the same broker that take a long time to start up
  *   It does not only occur for the broker that is the active controller
  *   A topic partition that laods quickly normally (15ms) can take a long time 
(9549ms) for the same broker only a day later.
  *   We experienced this on Kafka 2.4.0 as well, though it did not occur for a 
few weeks after upgrading to 2.5.0
  *   EBS throughput is not lower for slow startup cases compared to fast 
startup.
  *   The logs do not show any errors on startup when this happens.

Does anyone have an idea what could be causing this? Or any suggestions on 
further tests to conduct to get closer to the cause?

Thank you in advance and have a good day!

-- Pieter Hameete

Reply via email to