On Tue, Feb 25, 2025, at 14:40, Colin McCabe wrote: > On Tue, Feb 25, 2025, at 14:12, Kevin Wu wrote: >> Hey Colin, >> >> Thanks for the review. >> >> Regarding the metrics that reflect times: my initial thinking was to indeed >> have these be "soft state", which would be reset when a controller failover >> happens. I'm not sure if it's a big issue if these values get reset >> though, since a controller failover means brokers in startup would need to >> register again to the new controller anyways. Since what we're trying to >> monitor with these metrics is the broker's startup and shutdown statuses >> from the controller's view, my thinking was that exposing this soft state >> would be appropriate. > > I guess my concern is that the time-based metrics would reset to 0 on > every failover (if I understand the proposed implementation correctly). > That seems likely to create confusion. > >> >> There exist metrics that expose other soft state of the controller in >> `QuorumControllerMetrics.java`, and I thought the proposed metrics here >> would fit with what exists there. If instead we're updating these metrics >> based on the metadata log events for registration changes, it looks like >> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess >> we could add a `ControlledShutdownBrokerCount`. > > Yes, having a ControlledShutdownBrokerCount seems completely > appropriate. This is now hard state now as well, that can be read > directly out of the metadata log. > > (Or at least it is for MV > 3.3-IV3, which is comparatively ancient > now! We could just not support this metric when in very very old > metadata versions.)
I meant for MV >= 3.3-IV3 :) > >> For specifically tracking >> brokers in their initial startup fencing using the log events, I'm not >> totally sure as of now how we can actually do this from only the >> information in `BrokerRegistration`. I guess we know a broker is undergoing >> startup when it's fenced and has an `incarnationId` the controller hasn't >> seen before in the log? > > I think you would have to add another boolean to the > RegisterBrokerRecord in order to be able to detect brokers that were > starting up. It might be worth it to do this, though. > >> >> Regarding the per-broker metrics, what are your thoughts about the metric >> cardinality of this? There was some discussion about having a >> startup/shutdown time per-broker and I pushed back against it because the >> number of metrics we expose as a result is the number of brokers in the >> cluster. > > I think it would be useful to have a state for each broker exposed as a > metric. I can think of a lot of scenarios where this would be useful to > have. I don't think we should have more than one metric per broker > though, if we can help it. > >> Additionally, I don't think the controller can know of a live >> broker that has not attempted to register yet in order to make a metric for >> it and assign it a value of 0. Is a value of 0 for brokers that shutdown? >> In that case, doesn't that make the metric cardinality worse? I think if we >> decide to go that route we should only have states 1, 2, and 3. > > Unless the broker has been unregistered, we still know about it even if > it's fenced and not heartbeating. > > We don't know about it if we've never seen it before (like the first > time every node in the cluster starts up, ever), but that's a pretty > rare scenario. > > best, > Colin > > >> >> Best, >> Kevin Wu >> >> On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: >> >>> Hey all, >>> >>> I posted a KIP to monitor broker startup and controlled shutdown on the >>> controller-side. Here's the link: >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup >>> >>> Best, >>> Kevin Wu >>>