On Tue, Feb 25, 2025, at 14:12, Kevin Wu wrote: > Hey Colin, > > Thanks for the review. > > Regarding the metrics that reflect times: my initial thinking was to indeed > have these be "soft state", which would be reset when a controller failover > happens. I'm not sure if it's a big issue if these values get reset > though, since a controller failover means brokers in startup would need to > register again to the new controller anyways. Since what we're trying to > monitor with these metrics is the broker's startup and shutdown statuses > from the controller's view, my thinking was that exposing this soft state > would be appropriate.
I guess my concern is that the time-based metrics would reset to 0 on every failover (if I understand the proposed implementation correctly). That seems likely to create confusion. > > There exist metrics that expose other soft state of the controller in > `QuorumControllerMetrics.java`, and I thought the proposed metrics here > would fit with what exists there. If instead we're updating these metrics > based on the metadata log events for registration changes, it looks like > `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess > we could add a `ControlledShutdownBrokerCount`. Yes, having a ControlledShutdownBrokerCount seems completely appropriate. This is now hard state now as well, that can be read directly out of the metadata log. (Or at least it is for MV > 3.3-IV3, which is comparatively ancient now! We could just not support this metric when in very very old metadata versions.) > For specifically tracking > brokers in their initial startup fencing using the log events, I'm not > totally sure as of now how we can actually do this from only the > information in `BrokerRegistration`. I guess we know a broker is undergoing > startup when it's fenced and has an `incarnationId` the controller hasn't > seen before in the log? I think you would have to add another boolean to the RegisterBrokerRecord in order to be able to detect brokers that were starting up. It might be worth it to do this, though. > > Regarding the per-broker metrics, what are your thoughts about the metric > cardinality of this? There was some discussion about having a > startup/shutdown time per-broker and I pushed back against it because the > number of metrics we expose as a result is the number of brokers in the > cluster. I think it would be useful to have a state for each broker exposed as a metric. I can think of a lot of scenarios where this would be useful to have. I don't think we should have more than one metric per broker though, if we can help it. > Additionally, I don't think the controller can know of a live > broker that has not attempted to register yet in order to make a metric for > it and assign it a value of 0. Is a value of 0 for brokers that shutdown? > In that case, doesn't that make the metric cardinality worse? I think if we > decide to go that route we should only have states 1, 2, and 3. Unless the broker has been unregistered, we still know about it even if it's fenced and not heartbeating. We don't know about it if we've never seen it before (like the first time every node in the cluster starts up, ever), but that's a pretty rare scenario. best, Colin > > Best, > Kevin Wu > > On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: > >> Hey all, >> >> I posted a KIP to monitor broker startup and controlled shutdown on the >> controller-side. Here's the link: >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup >> >> Best, >> Kevin Wu >>