> > That's an interesting idea. However, I think that's going to be messy and > difficult for people to use. For example, how would you set up Grafana or > Datadog to use this? The string could also get extremely long (imagine 1000 > brokers all in startup.)
Hmm... Yeah from what I've read so far setting this up might be kind of challenging. I'm not seeing that OTEL supports gauges for string values. I'm still a little confused as to why having a per-broker metric to expose its state is preferred, but I think this is at least part of the reason? When drafting this KIP, I was only really considering the scenarios of the broker's initial metadata load during startup and their controlled shutdown, which my proposed metrics would cover. However, there are a lot of other scenarios with fenced brokers which have already started up that the existing fencedBrokers metric doesn't really give enough information about from the controller-side, since it just reports the number. For these scenarios, I don't think my proposed startup/shutdown focused metrics would be very useful. I'm on board with the proposed per-broker metric that exposes its state. I think it would be helpful to enumerate some specific cases though for the KIP. On Thu, Feb 27, 2025 at 2:19 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: > I guess my concern is that the time-based metrics would reset to 0 on >> every failover (if I understand the proposed implementation correctly). >> That seems likely to create confusion. > > Yeah that makes sense to me. I'm fine with moving towards the approach of > either (since I don't think we need both): > > - Exposing the number of brokers in 1. startup, 2. fenced (what we > have now), and 3. in controlled shutdown > - Exposing a per-broker metric reflecting the state of the broker > (more on this below). > > I think it would be useful to have a state for each broker exposed as a >> metric. I can think of a lot of scenarios where this would be useful to >> have. I don't think we should have more than one metric per broker though, >> if we can help it. > > Instead of having exactly a per-broker metric which exposes a number that > maps to a state (0, 1, 2, and 3), what if we expose 4 metrics whose values > are a comma-delimited string of the brokers in those states. > Something along the lines of: > > - Metric: name = BrokersNotRegistered, value = "kafka-1" > - Metric: name = BrokersRegisteredAndNeverUnfenced, value = "kafka-2" > - Metric: name = BrokersRegisteredAndFenced, value = "kafka-2,kafka-3" > - Metric: name = BrokersRegisteredRegisteredAndUnfenced, value = > "kafka-4,kafka-5" > > I guess there will be overlap between the second and third metrics, but > there do exist metrics that expose `Gauge<String>` values. > > On Tue, Feb 25, 2025 at 4:12 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: > >> Hey Colin, >> >> Thanks for the review. >> >> Regarding the metrics that reflect times: my initial thinking was to >> indeed have these be "soft state", which would be reset when a controller >> failover happens. I'm not sure if it's a big issue if these values get >> reset though, since a controller failover means brokers in startup would >> need to register again to the new controller anyways. Since what we're >> trying to monitor with these metrics is the broker's startup and shutdown >> statuses from the controller's view, my thinking was that exposing this >> soft state would be appropriate. >> >> There exist metrics that expose other soft state of the controller in >> `QuorumControllerMetrics.java`, and I thought the proposed metrics here >> would fit with what exists there. If instead we're updating these metrics >> based on the metadata log events for registration changes, it looks like >> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess >> we could add a `ControlledShutdownBrokerCount`. For specifically tracking >> brokers in their initial startup fencing using the log events, I'm not >> totally sure as of now how we can actually do this from only the >> information in `BrokerRegistration`. I guess we know a broker is undergoing >> startup when it's fenced and has an `incarnationId` the controller hasn't >> seen before in the log? >> >> Regarding the per-broker metrics, what are your thoughts about the metric >> cardinality of this? There was some discussion about having a >> startup/shutdown time per-broker and I pushed back against it because the >> number of metrics we expose as a result is the number of brokers in the >> cluster. Additionally, I don't think the controller can know of a live >> broker that has not attempted to register yet in order to make a metric for >> it and assign it a value of 0. Is a value of 0 for brokers that shutdown? >> In that case, doesn't that make the metric cardinality worse? I think if we >> decide to go that route we should only have states 1, 2, and 3. >> >> Best, >> Kevin Wu >> >> On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: >> >>> Hey all, >>> >>> I posted a KIP to monitor broker startup and controlled shutdown on the >>> controller-side. Here's the link: >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup >>> >>> Best, >>> Kevin Wu >>> >>