Mehari Beyene created KAFKA-18570:
-------------------------------------

             Summary: Add progress metric for log loading during Kafka broker 
startup 
                 Key: KAFKA-18570
                 URL: https://issues.apache.org/jira/browse/KAFKA-18570
             Project: Kafka
          Issue Type: Improvement
    Affects Versions: 4.0.0
            Reporter: Mehari Beyene


When a Kafka broker process starts up, it goes through the process of restoring 
the state of the broker based on the segment files stored on the disk and other 
auxiliary checkpoint files used to store the broker's state. 

In a clean shutdown scenario, Kafka undergoes a clean shutdown, meaning all 
states are persisted on the local disk, and the process of restoring the 
broker's state is relatively quick (estimated under 10 minutes for a partition 
count of 4000).

However, if the broker experiences an unclean shutdown, the log loading process 
will also involve recovering the broker state by replaying messages and trying 
to reconstruct the last known safe state of the broker. This recovery process 
can take a very long time. Anecdotal data shows we have seen processes that 
took more than two hours.

Log recovery is triggered as part of log loading, during this recovery process, 
there is no metric that indicates the progress, leaving both Kafka cluster 
administrators and customers blind to the state of the recovery. Not having any 
metric that operators can use to estimate the ETA is difficult for planning and 
managing expectations.

The exit criteria for this issue is to add a metric that would show the 
progress of log loading when a broker starts up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to