Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

Colin McCabe Mon, 14 Apr 2025 14:25:45 -0700

Hi Kevin,

The values for 
kafka.controller:type=KafkaController,name=BrokerRegistrationState seem a bit 
unintuitive.


Using 0 for active might be confusing to systems that treat metrics that aren't 
present as 0. Or to people just scanning the graph visually.

How about something like this?
  10 = fenced
  20 = controlled shutdown
  30 = active

That gives us some space to add additional states later if we want. It also 
avoids using 0 as a valid value...

Colin


On Mon, Apr 14, 2025, at 10:42, Kevin Wu wrote:
> Thanks for the comments Federico.
>
>> If I understand correctly unfenced == active. In the code we always
>> use the term active, so I think it would be better to use that for the >
> state 0 description.
> I've updated the KIP description to refer to "active".
>
>> You propose creating per-broker metrics indicating their state >
> (BrokerRegistrationState.kafka-X). Can't these new metrics be used to >
> derive broker counters in whatever monitor tool you decide to use? I >
> mean, we wouldn't need to store and provide > ControlledShutdownBrokerCount
> (proposed), FencedBrokerCount > (existing), ActiveBrokerCount (existing).  
> Yes,
> we can use this new metric to derive broker counters, but it's just more
> complicated for the operator to implement. Also, I don't think it's a huge
> issue that there's a slight redundancy here, since deleting the existing
> metrics will lead to compatibility issues with current monitoring setups.
>
> On Mon, Apr 14, 2025 at 12:25 PM Kevin Wu <kevin.wu2...@gmail.com> wrote:
>
>> Thanks for the comments Jose.
>> For 1 and 2, I've changed the naming of the metrics to follow your
>> suggestion of tags/attributes. For 3, I made a note as to why we need the
>> maximum. Basically, it's because the map that contains broker contact times
>> we're using as the source for these metrics removes entries when a broker
>> is fenced. Therefore, we need some default value when the entry doesn't
>> exist for the broker, but it is still registered.
>>
>> Thanks,
>> Kevin
>>
>> > Thanks for the improvement Kevin. I got a chance to look at the KIP.
>> >
>> > 1.
>>  kafka.controller:type=KafkaController,name=BrokerRegistrationState.kafka-X
>> >
>> > Can we use tags or attributes instead of different names? For example,
>> > how about
>> kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X
>> > where X is the node id?
>> >
>> > 2.
>> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X
>> >
>> > Same here, did you consider using tags or attributes for the node id?
>> >
>> > 3. For the metrics
>> >
>> kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X,
>> > you mentioned that you will limit the value to the heartbeat timeout.
>> > Why? Wouldn't it be a useful report the entire time since the last
>> > heartbeat? That is more information instead of just reporting the
>> > value up to the heartbeat timeout.
>> >
>> > Thanks,
>> > --
>> > -José
>> >
>>
>> On Thu, Mar 6, 2025 at 1:58 PM Kevin Wu <kevin.wu2...@gmail.com> wrote:
>>
>>> That's an interesting idea. However, I think that's going to be messy and
>>>> difficult for people to use. For example, how would you set up Grafana or
>>>> Datadog to use this? The string could also get extremely long (imagine 1000
>>>> brokers all in startup.)
>>>
>>> Hmm... Yeah from what I've read so far setting this up might be kind of
>>> challenging. I'm not seeing that OTEL supports gauges for string values.
>>>
>>> I'm still a little confused as to why having a per-broker metric to
>>> expose its state is preferred, but I think this is at least part of the
>>> reason? When drafting this KIP, I was only really considering the scenarios
>>> of the broker's initial metadata load during startup and their controlled
>>> shutdown, which my proposed metrics would cover. However, there are a lot
>>> of other scenarios with fenced brokers which have already started up that
>>> the existing fencedBrokers metric doesn't really give enough information
>>> about from the controller-side, since it just reports the number. For these
>>> scenarios, I don't think my proposed startup/shutdown focused metrics would
>>> be very useful.
>>> I'm on board with the proposed per-broker metric that exposes its state.
>>> I think it would be helpful to enumerate some specific cases though for the
>>> KIP.
>>>
>>> On Thu, Feb 27, 2025 at 2:19 PM Kevin Wu <kevin.wu2...@gmail.com> wrote:
>>>
>>>> I guess my concern is that the time-based metrics would reset to 0 on
>>>>> every failover (if I understand the proposed implementation correctly).
>>>>> That seems likely to create confusion.
>>>>
>>>> Yeah that makes sense to me. I'm fine with moving towards the approach
>>>> of either (since I don't think we need both):
>>>>
>>>>    - Exposing the number of brokers in 1. startup, 2. fenced (what we
>>>>    have now), and 3. in controlled shutdown
>>>>    - Exposing a per-broker metric reflecting the state of the broker
>>>>    (more on this below).
>>>>
>>>> I think it would be useful to have a state for each broker exposed as a
>>>>> metric. I can think of a lot of scenarios where this would be useful to
>>>>> have. I don't think we should have more than one metric per broker though,
>>>>> if we can help it.
>>>>
>>>> Instead of having exactly a per-broker metric which exposes a number
>>>> that maps to a state (0, 1, 2, and 3), what if we expose 4 metrics whose
>>>> values are a comma-delimited string of the brokers in those states.
>>>> Something along the lines of:
>>>>
>>>>    - Metric: name = BrokersNotRegistered, value = "kafka-1"
>>>>    - Metric: name = BrokersRegisteredAndNeverUnfenced, value = "kafka-2"
>>>>    - Metric: name = BrokersRegisteredAndFenced, value =
>>>>    "kafka-2,kafka-3"
>>>>    - Metric: name = BrokersRegisteredRegisteredAndUnfenced, value =
>>>>    "kafka-4,kafka-5"
>>>>
>>>> I guess there will be overlap between the second and third metrics, but
>>>> there do exist metrics that expose `Gauge<String>` values.
>>>>
>>>> On Tue, Feb 25, 2025 at 4:12 PM Kevin Wu <kevin.wu2...@gmail.com> wrote:
>>>>
>>>>> Hey Colin,
>>>>>
>>>>> Thanks for the review.
>>>>>
>>>>> Regarding the metrics that reflect times: my initial thinking was to
>>>>> indeed have these be "soft state", which would be reset when a controller
>>>>> failover happens.  I'm not sure if it's a big issue if these values get
>>>>> reset though, since a controller failover means brokers in startup would
>>>>> need to register again to the new controller anyways. Since what we're
>>>>> trying to monitor with these metrics is the broker's startup and shutdown
>>>>> statuses from the controller's view, my thinking was that exposing this
>>>>> soft state would be appropriate.
>>>>>
>>>>> There exist metrics that expose other soft state of the controller in
>>>>> `QuorumControllerMetrics.java`, and I thought the proposed metrics here
>>>>> would fit with what exists there. If instead we're updating these metrics
>>>>> based on the metadata log events for registration changes, it looks like
>>>>> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess
>>>>> we could add a `ControlledShutdownBrokerCount`. For specifically tracking
>>>>> brokers in their initial startup fencing using the log events, I'm not
>>>>> totally sure as of now how we can actually do this from only the
>>>>> information in `BrokerRegistration`. I guess we know a broker is 
>>>>> undergoing
>>>>> startup when it's fenced and has an `incarnationId` the controller hasn't
>>>>> seen before in the log?
>>>>>
>>>>> Regarding the per-broker metrics, what are your thoughts about the
>>>>> metric cardinality of this? There was some discussion about having a
>>>>> startup/shutdown time per-broker and I pushed back against it because the
>>>>> number of metrics we expose as a result is the number of brokers in the
>>>>> cluster. Additionally, I don't think the controller can know of a live
>>>>> broker that has not attempted to register yet in order to make a metric 
>>>>> for
>>>>> it and assign it a value of 0. Is a value of 0 for brokers that shutdown?
>>>>> In that case, doesn't that make the metric cardinality worse? I think if 
>>>>> we
>>>>> decide to go that route we should only have states 1, 2, and 3.
>>>>>
>>>>> Best,
>>>>> Kevin Wu
>>>>>
>>>>> On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> I posted a KIP to monitor broker startup and controlled shutdown on
>>>>>> the controller-side. Here's the link:
>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup
>>>>>>
>>>>>> Best,
>>>>>> Kevin Wu
>>>>>>
>>>>>

Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

Reply via email to