>
> I guess my concern is that the time-based metrics would reset to 0 on
> every failover (if I understand the proposed implementation correctly).
> That seems likely to create confusion.

Yeah that makes sense to me. I'm fine with moving towards the approach of
either (since I don't think we need both):

   - Exposing the number of brokers in 1. startup, 2. fenced (what we have
   now), and 3. in controlled shutdown
   - Exposing a per-broker metric reflecting the state of the broker (more
   on this below).

I think it would be useful to have a state for each broker exposed as a
> metric. I can think of a lot of scenarios where this would be useful to
> have. I don't think we should have more than one metric per broker though,
> if we can help it.

Instead of having exactly a per-broker metric which exposes a number that
maps to a state (0, 1, 2, and 3), what if we expose 4 metrics whose values
are a comma-delimited string of the brokers in those states.
Something along the lines of:

   - Metric: name = BrokersNotRegistered, value = "kafka-1"
   - Metric: name = BrokersRegisteredAndNeverUnfenced, value = "kafka-2"
   - Metric: name = BrokersRegisteredAndFenced, value = "kafka-2,kafka-3"
   - Metric: name = BrokersRegisteredRegisteredAndUnfenced, value =
   "kafka-4,kafka-5"

I guess there will be overlap between the second and third metrics, but
there do exist metrics that expose `Gauge<String>` values.

On Tue, Feb 25, 2025 at 4:12 PM Kevin Wu <kevin.wu2...@gmail.com> wrote:

> Hey Colin,
>
> Thanks for the review.
>
> Regarding the metrics that reflect times: my initial thinking was to
> indeed have these be "soft state", which would be reset when a controller
> failover happens.  I'm not sure if it's a big issue if these values get
> reset though, since a controller failover means brokers in startup would
> need to register again to the new controller anyways. Since what we're
> trying to monitor with these metrics is the broker's startup and shutdown
> statuses from the controller's view, my thinking was that exposing this
> soft state would be appropriate.
>
> There exist metrics that expose other soft state of the controller in
> `QuorumControllerMetrics.java`, and I thought the proposed metrics here
> would fit with what exists there. If instead we're updating these metrics
> based on the metadata log events for registration changes, it looks like
> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess
> we could add a `ControlledShutdownBrokerCount`. For specifically tracking
> brokers in their initial startup fencing using the log events, I'm not
> totally sure as of now how we can actually do this from only the
> information in `BrokerRegistration`. I guess we know a broker is undergoing
> startup when it's fenced and has an `incarnationId` the controller hasn't
> seen before in the log?
>
> Regarding the per-broker metrics, what are your thoughts about the metric
> cardinality of this? There was some discussion about having a
> startup/shutdown time per-broker and I pushed back against it because the
> number of metrics we expose as a result is the number of brokers in the
> cluster. Additionally, I don't think the controller can know of a live
> broker that has not attempted to register yet in order to make a metric for
> it and assign it a value of 0. Is a value of 0 for brokers that shutdown?
> In that case, doesn't that make the metric cardinality worse? I think if we
> decide to go that route we should only have states 1, 2, and 3.
>
> Best,
> Kevin Wu
>
> On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com> wrote:
>
>> Hey all,
>>
>> I posted a KIP to monitor broker startup and controlled shutdown on the
>> controller-side. Here's the link:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup
>>
>> Best,
>> Kevin Wu
>>
>

Reply via email to