Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

Kevin Wu Mon, 14 Apr 2025 10:26:34 -0700

Thanks for the comments Jose.
For 1 and 2, I've changed the naming of the metrics to follow your
suggestion of tags/attributes. For 3, I made a note as to why we need the
maximum. Basically, it's because the map that contains broker contact times
we're using as the source for these metrics removes entries when a broker
is fenced. Therefore, we need some default value when the entry doesn't
exist for the broker, but it is still registered.


Thanks,
Kevin

> Thanks for the improvement Kevin. I got a chance to look at the KIP.
>
> 1.
 kafka.controller:type=KafkaController,name=BrokerRegistrationState.kafka-X
>
> Can we use tags or attributes instead of different names? For example,
> how about
kafka.controller:type=KafkaController,name=BrokerRegistrationState,broker=X
> where X is the node id?
>
> 2.
kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X
>
> Same here, did you consider using tags or attributes for the node id?
>
> 3. For the metrics
>
kafka.controller:type=KafkaController,name=TimeSinceLastHeartbeatReceivedMs.kafka-X,
> you mentioned that you will limit the value to the heartbeat timeout.
> Why? Wouldn't it be a useful report the entire time since the last
> heartbeat? That is more information instead of just reporting the
> value up to the heartbeat timeout.
>
> Thanks,
> --
> -José
>

On Thu, Mar 6, 2025 at 1:58 PM Kevin Wu <kevin.wu2...@gmail.com> wrote:

> That's an interesting idea. However, I think that's going to be messy and
>> difficult for people to use. For example, how would you set up Grafana or
>> Datadog to use this? The string could also get extremely long (imagine 1000
>> brokers all in startup.)
>
> Hmm... Yeah from what I've read so far setting this up might be kind of
> challenging. I'm not seeing that OTEL supports gauges for string values.
>
> I'm still a little confused as to why having a per-broker metric to expose
> its state is preferred, but I think this is at least part of the reason?
> When drafting this KIP, I was only really considering the scenarios of the
> broker's initial metadata load during startup and their controlled
> shutdown, which my proposed metrics would cover. However, there are a lot
> of other scenarios with fenced brokers which have already started up that
> the existing fencedBrokers metric doesn't really give enough information
> about from the controller-side, since it just reports the number. For these
> scenarios, I don't think my proposed startup/shutdown focused metrics would
> be very useful.
> I'm on board with the proposed per-broker metric that exposes its state. I
> think it would be helpful to enumerate some specific cases though for the
> KIP.
>
> On Thu, Feb 27, 2025 at 2:19 PM Kevin Wu <kevin.wu2...@gmail.com> wrote:
>
>> I guess my concern is that the time-based metrics would reset to 0 on
>>> every failover (if I understand the proposed implementation correctly).
>>> That seems likely to create confusion.
>>
>> Yeah that makes sense to me. I'm fine with moving towards the approach of
>> either (since I don't think we need both):
>>
>>    - Exposing the number of brokers in 1. startup, 2. fenced (what we
>>    have now), and 3. in controlled shutdown
>>    - Exposing a per-broker metric reflecting the state of the broker
>>    (more on this below).
>>
>> I think it would be useful to have a state for each broker exposed as a
>>> metric. I can think of a lot of scenarios where this would be useful to
>>> have. I don't think we should have more than one metric per broker though,
>>> if we can help it.
>>
>> Instead of having exactly a per-broker metric which exposes a number that
>> maps to a state (0, 1, 2, and 3), what if we expose 4 metrics whose values
>> are a comma-delimited string of the brokers in those states.
>> Something along the lines of:
>>
>>    - Metric: name = BrokersNotRegistered, value = "kafka-1"
>>    - Metric: name = BrokersRegisteredAndNeverUnfenced, value = "kafka-2"
>>    - Metric: name = BrokersRegisteredAndFenced, value = "kafka-2,kafka-3"
>>    - Metric: name = BrokersRegisteredRegisteredAndUnfenced, value =
>>    "kafka-4,kafka-5"
>>
>> I guess there will be overlap between the second and third metrics, but
>> there do exist metrics that expose `Gauge<String>` values.
>>
>> On Tue, Feb 25, 2025 at 4:12 PM Kevin Wu <kevin.wu2...@gmail.com> wrote:
>>
>>> Hey Colin,
>>>
>>> Thanks for the review.
>>>
>>> Regarding the metrics that reflect times: my initial thinking was to
>>> indeed have these be "soft state", which would be reset when a controller
>>> failover happens.  I'm not sure if it's a big issue if these values get
>>> reset though, since a controller failover means brokers in startup would
>>> need to register again to the new controller anyways. Since what we're
>>> trying to monitor with these metrics is the broker's startup and shutdown
>>> statuses from the controller's view, my thinking was that exposing this
>>> soft state would be appropriate.
>>>
>>> There exist metrics that expose other soft state of the controller in
>>> `QuorumControllerMetrics.java`, and I thought the proposed metrics here
>>> would fit with what exists there. If instead we're updating these metrics
>>> based on the metadata log events for registration changes, it looks like
>>> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess
>>> we could add a `ControlledShutdownBrokerCount`. For specifically tracking
>>> brokers in their initial startup fencing using the log events, I'm not
>>> totally sure as of now how we can actually do this from only the
>>> information in `BrokerRegistration`. I guess we know a broker is undergoing
>>> startup when it's fenced and has an `incarnationId` the controller hasn't
>>> seen before in the log?
>>>
>>> Regarding the per-broker metrics, what are your thoughts about the
>>> metric cardinality of this? There was some discussion about having a
>>> startup/shutdown time per-broker and I pushed back against it because the
>>> number of metrics we expose as a result is the number of brokers in the
>>> cluster. Additionally, I don't think the controller can know of a live
>>> broker that has not attempted to register yet in order to make a metric for
>>> it and assign it a value of 0. Is a value of 0 for brokers that shutdown?
>>> In that case, doesn't that make the metric cardinality worse? I think if we
>>> decide to go that route we should only have states 1, 2, and 3.
>>>
>>> Best,
>>> Kevin Wu
>>>
>>> On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com>
>>> wrote:
>>>
>>>> Hey all,
>>>>
>>>> I posted a KIP to monitor broker startup and controlled shutdown on the
>>>> controller-side. Here's the link:
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup
>>>>
>>>> Best,
>>>> Kevin Wu
>>>>
>>>

Re: [DISCUSS] KIP-1131: Controller-side monitoring for broker shutdown and startup

Reply via email to