Hey Colin, Thanks for the review.
Regarding the metrics that reflect times: my initial thinking was to indeed have these be "soft state", which would be reset when a controller failover happens. I'm not sure if it's a big issue if these values get reset though, since a controller failover means brokers in startup would need to register again to the new controller anyways. Since what we're trying to monitor with these metrics is the broker's startup and shutdown statuses from the controller's view, my thinking was that exposing this soft state would be appropriate. There exist metrics that expose other soft state of the controller in `QuorumControllerMetrics.java`, and I thought the proposed metrics here would fit with what exists there. If instead we're updating these metrics based on the metadata log events for registration changes, it looks like `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess we could add a `ControlledShutdownBrokerCount`. For specifically tracking brokers in their initial startup fencing using the log events, I'm not totally sure as of now how we can actually do this from only the information in `BrokerRegistration`. I guess we know a broker is undergoing startup when it's fenced and has an `incarnationId` the controller hasn't seen before in the log? Regarding the per-broker metrics, what are your thoughts about the metric cardinality of this? There was some discussion about having a startup/shutdown time per-broker and I pushed back against it because the number of metrics we expose as a result is the number of brokers in the cluster. Additionally, I don't think the controller can know of a live broker that has not attempted to register yet in order to make a metric for it and assign it a value of 0. Is a value of 0 for brokers that shutdown? In that case, doesn't that make the metric cardinality worse? I think if we decide to go that route we should only have states 1, 2, and 3. Best, Kevin Wu On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: > Hey all, > > I posted a KIP to monitor broker startup and controlled shutdown on the > controller-side. Here's the link: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup > > Best, > Kevin Wu >