On Tue, Feb 25, 2025, at 14:40, Colin McCabe wrote:
> On Tue, Feb 25, 2025, at 14:12, Kevin Wu wrote:
>> Hey Colin,
>>
>> Thanks for the review.
>>
>> Regarding the metrics that reflect times: my initial thinking was to indeed
>> have these be "soft state", which would be reset when a controller failover
>> happens.  I'm not sure if it's a big issue if these values get reset
>> though, since a controller failover means brokers in startup would need to
>> register again to the new controller anyways. Since what we're trying to
>> monitor with these metrics is the broker's startup and shutdown statuses
>> from the controller's view, my thinking was that exposing this soft state
>> would be appropriate.
>
> I guess my concern is that the time-based metrics would reset to 0 on 
> every failover (if I understand the proposed implementation correctly). 
> That seems likely to create confusion.
>
>>
>> There exist metrics that expose other soft state of the controller in
>> `QuorumControllerMetrics.java`, and I thought the proposed metrics here
>> would fit with what exists there. If instead we're updating these metrics
>> based on the metadata log events for registration changes, it looks like
>> `ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess
>> we could add a `ControlledShutdownBrokerCount`.
>
> Yes, having a ControlledShutdownBrokerCount seems completely 
> appropriate. This is now hard state now as well, that can be read 
> directly out of the metadata log.
>
> (Or at least it is for MV > 3.3-IV3, which is comparatively ancient 
> now! We could just not support this metric when in very very old 
> metadata versions.)

I meant for MV >= 3.3-IV3 :)

>
>> For specifically tracking
>> brokers in their initial startup fencing using the log events, I'm not
>> totally sure as of now how we can actually do this from only the
>> information in `BrokerRegistration`. I guess we know a broker is undergoing
>> startup when it's fenced and has an `incarnationId` the controller hasn't
>> seen before in the log?
>
> I think you would have to add another boolean to the 
> RegisterBrokerRecord in order to be able to detect brokers that were 
> starting up. It might be worth it to do this, though.
>
>>
>> Regarding the per-broker metrics, what are your thoughts about the metric
>> cardinality of this? There was some discussion about having a
>> startup/shutdown time per-broker and I pushed back against it because the
>> number of metrics we expose as a result is the number of brokers in the
>> cluster.
>
> I think it would be useful to have a state for each broker exposed as a 
> metric. I can think of a lot of scenarios where this would be useful to 
> have. I don't think we should have more than one metric per broker 
> though, if we can help it.
>
>> Additionally, I don't think the controller can know of a live
>> broker that has not attempted to register yet in order to make a metric for
>> it and assign it a value of 0. Is a value of 0 for brokers that shutdown?
>> In that case, doesn't that make the metric cardinality worse? I think if we
>> decide to go that route we should only have states 1, 2, and 3.
>
> Unless the broker has been unregistered, we still know about it even if 
> it's fenced and not heartbeating.
>
> We don't know about it if we've never seen it before (like the first 
> time every node in the cluster starts up, ever), but that's a pretty 
> rare scenario. 
>
> best,
> Colin
>
>
>>
>> Best,
>> Kevin Wu
>>
>> On Mon, Jan 27, 2025 at 12:56 PM Kevin Wu <kevin.wu2...@gmail.com> wrote:
>>
>>> Hey all,
>>>
>>> I posted a KIP to monitor broker startup and controlled shutdown on the
>>> controller-side. Here's the link:
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup
>>>
>>> Best,
>>> Kevin Wu
>>>

Reply via email to