Hey Colin, > How about something like this? > 10 = fenced > 20 = controlled shutdown > 30 = active
Yeah, that seems reasonable to me. Thanks for the suggestion. Kevin On Mon, Apr 14, 2025 at 12:42 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: > Thanks for the comments Federico. > > > If I understand correctly unfenced == active. In the code we always > > use the term active, so I think it would be better to use that for the > > state 0 description. > I've updated the KIP description to refer to "active". > > > You propose creating per-broker metrics indicating their state > > (BrokerRegistrationState.kafka-X). Can't these new metrics be used to > > derive broker counters in whatever monitor tool you decide to use? I > > mean, we wouldn't need to store and provide > ControlledShutdownBrokerCount > (proposed), FencedBrokerCount > (existing), ActiveBrokerCount (existing). > Yes, we can use this new metric to derive broker counters, but it's just > more complicated for the operator to implement. Also, I don't think it's a > huge issue that there's a slight redundancy here, since deleting the > existing metrics will lead to compatibility issues with current monitoring > setups. > > On Mon, Apr 14, 2025 at 12:25 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: > >> Thanks for the comments Jose. >> For 1 and 2, I've changed the naming of the metrics to follow your >> suggestion of tags/attributes. For 3, I made a note as to why we need the >> maximum. Basically, it's because the map that contains broker contact times >> we're using as the source for these metrics removes entries when a broker >> is fenced. Therefore, we need some default value when the entry doesn't >> exist for the broker, but it is still registered. >> >> Thanks, >> Kevin >> >> > Thanks for the improvement Kevin. I got a chance to look at the KIP. >> > >> > 1. >> kafka.controller:type=KafkaController,name=BrokerRegistrationState.kafka-X >> > >> > Can we use tags or attributes instead of different names? For example, >> > how about >> kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X >> > where X is the node id? >> > >> > 2. >> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X >> > >> > Same here, did you consider using tags or attributes for the node id? >> > >> > 3. For the metrics >> > >> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X, >> > you mentioned that you will limit the value to the heartbeat timeout. >> > Why? Wouldn't it be a useful report the entire time since the last >> > heartbeat? That is more information instead of just reporting the >> > value up to the heartbeat timeout. >> > >> > Thanks, >> > -- >> > -José >> > >> >> On Thu, Mar 6, 2025 at 1:58 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: >> >>> That's an interesting idea. However, I think that's going to be messy >>>> and difficult for people to use. For example, how would you set up Grafana >>>> or Datadog to use this? The string could also get extremely long (imagine >>>> 1000 brokers all in startup.) >>> >>> Hmm... Yeah from what I've read so far setting this up might be kind of >>> challenging. I'm not seeing that OTEL supports gauges for string values. >>> >>> I'm still a little confused as to why having a per-broker metric to >>> expose its state is preferred, but I think this is at least part of the >>> reason? When drafting this KIP, I was only really considering the scenarios >>> of the broker's initial metadata load during startup and their controlled >>> shutdown, which my proposed metrics would cover. However, there are a lot >>> of other scenarios with fenced brokers which have already started up that >>> the existing fencedBrokers metric doesn't really give enough information >>> about from the controller-side, since it just reports the number. For these >>> scenarios, I don't think my proposed startup/shutdown focused metrics would >>> be very useful. >>> I'm on board with the proposed per-broker metric that exposes its state. >>> I think it would be helpful to enumerate some specific cases though for the >>> KIP. >>> >>> On Thu, Feb 27, 2025 at 2:19 PM Kevin Wu <kevin.wu2...@gmail.com> wrote: >>> >>>> I guess my concern is that the time-based metrics would reset to 0 on >>>>> every failover (if I understand the proposed implementation correctly). >>>>> That seems likely to create confusion. >>>> >>>> Yeah that makes sense to me. I'm fine with moving towards the approach >>>> of either (since I don't think we need both): >>>> >>>> - Exposing the number of brokers in 1. startup, 2. fenced (what we >>>> have now), and 3. in controlled shutdown >>>> - Exposing a per-broker metric reflecting the state of the broker >>>> (more on this below). >>>> >>>> I think it would be useful to have a state for each broker exposed as a >>>>> metric. I can think of a lot of scenarios where this would be useful to >>>>> have. I don't think we should have more than one metric per broker though, >>>>> if we can help it. >>>> >>>> Instead of having exactly a per-broker metric which exposes a number >>>> that maps to a state (0, 1, 2, and 3), what if we expose 4 metrics whose >>>> values are a comma-delimited string of the brokers in those states. >>>> Something along the lines of: >>>> >>>> - Metric: name = BrokersNotRegistered, value = "kafka-1" >>>> - Metric: name = BrokersRegisteredAndNeverUnfenced, value = >>>> "kafka-2" >>>> - Metric: name = BrokersRegisteredAndFenced, value = >>>> "kafka-2,kafka-3" >>>> - Metric: name = BrokersRegisteredRegisteredAndUnfenced, value = >>>> "kafka-4,kafka-5" >>>> >>>> I guess there will be overlap between the second and third metrics, but >>>> there do exist metrics that expose `Gauge<String>` values. >>>> >>>> On Tue, Feb 25, 2025 at 4:12 PM Kevin Wu <kevin.wu2...@gmail.com> >>>> wrote: >>>> >>>>> Hey Colin, >>>>> >>>>> Thanks for the review. >>>>> >>>>> Regarding the metrics that reflect times: my initial thinking was to >>>>> indeed have these be "soft state", which would be reset when a controller >>>>> failover happens. I'm not sure if it's a big issue if these values get >>>>> reset though, since a controller failover means brokers in startup would >>>>> need to register again to the new controller anyways. Since what we're >>>>> trying to monitor with these metrics is the broker's startup and shutdown >>>>> statuses from the controller's view, my thinking was that exposing this >>>>> soft state would be appropriate. >>>>> >>>>> There exist metrics that expose other soft state of the controller in >>>>> `QuorumControllerMetrics.java`, and I thought the proposed metrics here >>>>> would fit with what exists there. If instead we're updating these metrics >>>>> based on the metadata log events for registration changes, it looks like >>>>> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess >>>>> we could add a `ControlledShutdownBrokerCount`. For specifically tracking >>>>> brokers in their initial startup fencing using the log events, I'm not >>>>> totally sure as of now how we can actually do this from only the >>>>> information in `BrokerRegistration`. I guess we know a broker is >>>>> undergoing >>>>> startup when it's fenced and has an `incarnationId` the controller hasn't >>>>> seen before in the log? >>>>> >>>>> Regarding the per-broker metrics, what are your thoughts about the >>>>> metric cardinality of this? There was some discussion about having a >>>>> startup/shutdown time per-broker and I pushed back against it because the >>>>> number of metrics we expose as a result is the number of brokers in the >>>>> cluster. Additionally, I don't think the controller can know of a live >>>>> broker that has not attempted to register yet in order to make a metric >>>>> for >>>>> it and assign it a value of 0. Is a value of 0 for brokers that shutdown? >>>>> In that case, doesn't that make the metric cardinality worse? I think if >>>>> we >>>>> decide to go that route we should only have states 1, 2, and 3. >>>>> >>>>> Best, >>>>> Kevin Wu >>>>> >>>>> On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hey all, >>>>>> >>>>>> I posted a KIP to monitor broker startup and controlled shutdown on >>>>>> the controller-side. Here's the link: >>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup >>>>>> >>>>>> Best, >>>>>> Kevin Wu >>>>>> >>>>>