Thanks for the comments Jose. For 1 and 2, I've changed the naming of the metrics to follow your suggestion of tags/attributes. For 3, I made a note as to why we need the maximum. Basically, it's because the map that contains broker contact times we're using as the source for these metrics removes entries when a broker is fenced. Therefore, we need some default value when the entry doesn't exist for the broker, but it is still registered.
Thanks, Kevin > Thanks for the improvement Kevin. I got a chance to look at the KIP. > > 1. kafka.controller:type=KafkaController,name=BrokerRegistrationState.kafka-X > > Can we use tags or attributes instead of different names? For example, > how about kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X > where X is the node id? > > 2. kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X > > Same here, did you consider using tags or attributes for the node id? > > 3. For the metrics > kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X, > you mentioned that you will limit the value to the heartbeat timeout. > Why? Wouldn't it be a useful report the entire time since the last > heartbeat? That is more information instead of just reporting the > value up to the heartbeat timeout. > > Thanks, > -- > -José > On Thu, Mar 6, 2025 at 1:58 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: > That's an interesting idea. However, I think that's going to be messy and >> difficult for people to use. For example, how would you set up Grafana or >> Datadog to use this? The string could also get extremely long (imagine 1000 >> brokers all in startup.) > > Hmm... Yeah from what I've read so far setting this up might be kind of > challenging. I'm not seeing that OTEL supports gauges for string values. > > I'm still a little confused as to why having a per-broker metric to expose > its state is preferred, but I think this is at least part of the reason? > When drafting this KIP, I was only really considering the scenarios of the > broker's initial metadata load during startup and their controlled > shutdown, which my proposed metrics would cover. However, there are a lot > of other scenarios with fenced brokers which have already started up that > the existing fencedBrokers metric doesn't really give enough information > about from the controller-side, since it just reports the number. For these > scenarios, I don't think my proposed startup/shutdown focused metrics would > be very useful. > I'm on board with the proposed per-broker metric that exposes its state. I > think it would be helpful to enumerate some specific cases though for the > KIP. > > On Thu, Feb 27, 2025 at 2:19 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: > >> I guess my concern is that the time-based metrics would reset to 0 on >>> every failover (if I understand the proposed implementation correctly). >>> That seems likely to create confusion. >> >> Yeah that makes sense to me. I'm fine with moving towards the approach of >> either (since I don't think we need both): >> >> - Exposing the number of brokers in 1. startup, 2. fenced (what we >> have now), and 3. in controlled shutdown >> - Exposing a per-broker metric reflecting the state of the broker >> (more on this below). >> >> I think it would be useful to have a state for each broker exposed as a >>> metric. I can think of a lot of scenarios where this would be useful to >>> have. I don't think we should have more than one metric per broker though, >>> if we can help it. >> >> Instead of having exactly a per-broker metric which exposes a number that >> maps to a state (0, 1, 2, and 3), what if we expose 4 metrics whose values >> are a comma-delimited string of the brokers in those states. >> Something along the lines of: >> >> - Metric: name = BrokersNotRegistered, value = "kafka-1" >> - Metric: name = BrokersRegisteredAndNeverUnfenced, value = "kafka-2" >> - Metric: name = BrokersRegisteredAndFenced, value = "kafka-2,kafka-3" >> - Metric: name = BrokersRegisteredRegisteredAndUnfenced, value = >> "kafka-4,kafka-5" >> >> I guess there will be overlap between the second and third metrics, but >> there do exist metrics that expose `Gauge<String>` values. >> >> On Tue, Feb 25, 2025 at 4:12 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: >> >>> Hey Colin, >>> >>> Thanks for the review. >>> >>> Regarding the metrics that reflect times: my initial thinking was to >>> indeed have these be "soft state", which would be reset when a controller >>> failover happens. I'm not sure if it's a big issue if these values get >>> reset though, since a controller failover means brokers in startup would >>> need to register again to the new controller anyways. Since what we're >>> trying to monitor with these metrics is the broker's startup and shutdown >>> statuses from the controller's view, my thinking was that exposing this >>> soft state would be appropriate. >>> >>> There exist metrics that expose other soft state of the controller in >>> `QuorumControllerMetrics.java`, and I thought the proposed metrics here >>> would fit with what exists there. If instead we're updating these metrics >>> based on the metadata log events for registration changes, it looks like >>> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess >>> we could add a `ControlledShutdownBrokerCount`. For specifically tracking >>> brokers in their initial startup fencing using the log events, I'm not >>> totally sure as of now how we can actually do this from only the >>> information in `BrokerRegistration`. I guess we know a broker is undergoing >>> startup when it's fenced and has an `incarnationId` the controller hasn't >>> seen before in the log? >>> >>> Regarding the per-broker metrics, what are your thoughts about the >>> metric cardinality of this? There was some discussion about having a >>> startup/shutdown time per-broker and I pushed back against it because the >>> number of metrics we expose as a result is the number of brokers in the >>> cluster. Additionally, I don't think the controller can know of a live >>> broker that has not attempted to register yet in order to make a metric for >>> it and assign it a value of 0. Is a value of 0 for brokers that shutdown? >>> In that case, doesn't that make the metric cardinality worse? I think if we >>> decide to go that route we should only have states 1, 2, and 3. >>> >>> Best, >>> Kevin Wu >>> >>> On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com> >>> wrote: >>> >>>> Hey all, >>>> >>>> I posted a KIP to monitor broker startup and controlled shutdown on the >>>> controller-side. Here's the link: >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup >>>> >>>> Best, >>>> Kevin Wu >>>> >>>