Mehari Beyene created KAFKA-18570: ------------------------------------- Summary: Add progress metric for log loading during Kafka broker startup Key: KAFKA-18570 URL: https://issues.apache.org/jira/browse/KAFKA-18570 Project: Kafka Issue Type: Improvement Affects Versions: 4.0.0 Reporter: Mehari Beyene
When a Kafka broker process starts up, it goes through the process of restoring the state of the broker based on the segment files stored on the disk and other auxiliary checkpoint files used to store the broker's state. In a clean shutdown scenario, Kafka undergoes a clean shutdown, meaning all states are persisted on the local disk, and the process of restoring the broker's state is relatively quick (estimated under 10 minutes for a partition count of 4000). However, if the broker experiences an unclean shutdown, the log loading process will also involve recovering the broker state by replaying messages and trying to reconstruct the last known safe state of the broker. This recovery process can take a very long time. Anecdotal data shows we have seen processes that took more than two hours. Log recovery is triggered as part of log loading, during this recovery process, there is no metric that indicates the progress, leaving both Kafka cluster administrators and customers blind to the state of the recovery. Not having any metric that operators can use to estimate the ETA is difficult for planning and managing expectations. The exit criteria for this issue is to add a metric that would show the progress of log loading when a broker starts up. -- This message was sent by Atlassian Jira (v8.20.10#820010)