Hey Colin,

Thanks for the review.

Regarding the metrics that reflect times: my initial thinking was to indeed
have these be "soft state", which would be reset when a controller failover
happens.  I'm not sure if it's a big issue if these values get reset
though, since a controller failover means brokers in startup would need to
register again to the new controller anyways. Since what we're trying to
monitor with these metrics is the broker's startup and shutdown statuses
from the controller's view, my thinking was that exposing this soft state
would be appropriate.

There exist metrics that expose other soft state of the controller in
`QuorumControllerMetrics.java`, and I thought the proposed metrics here
would fit with what exists there. If instead we're updating these metrics
based on the metadata log events for registration changes, it looks like
`ControllerMetadataMetrics` has a `FencedBrokerCount` metric, and I guess
we could add a `ControlledShutdownBrokerCount`. For specifically tracking
brokers in their initial startup fencing using the log events, I'm not
totally sure as of now how we can actually do this from only the
information in `BrokerRegistration`. I guess we know a broker is undergoing
startup when it's fenced and has an `incarnationId` the controller hasn't
seen before in the log?

Regarding the per-broker metrics, what are your thoughts about the metric
cardinality of this? There was some discussion about having a
startup/shutdown time per-broker and I pushed back against it because the
number of metrics we expose as a result is the number of brokers in the
cluster. Additionally, I don't think the controller can know of a live
broker that has not attempted to register yet in order to make a metric for
it and assign it a value of 0. Is a value of 0 for brokers that shutdown?
In that case, doesn't that make the metric cardinality worse? I think if we
decide to go that route we should only have states 1, 2, and 3.

Best,
Kevin Wu


On 2025/02/20 00:22:00 Colin McCabe wrote:
> Hi Kevin,
>
> Thanks for the KIP.
>
> I notice that you have some metrics that reflect times here, such as
LongestPendingStartupTimeMs, LongestPendingControlledShudownTimeMs, etc. I
think this may be difficult to do with complete accuracy because we don't
include times in the metadata log events for registration changes. If we
just do the obvious thing and make the times "soft state" then these times
will be reset when there is a controller failover.
>
> Perhaps it would be simpler to cut out the metrics that include a time
and just have NumberOfBrokersInStartup and
NumberOfBrokersInControlledShutdown ? Then people could set up an alert on
these metrics. For example, set up an alert that fires if
NumberOfBrokersInStartup is non-zero for more than 5 minutes.
>
> I wonder if it would be a good idea to have a per-broker metric on the
controller that showed the state of each broker. Like 0 = not registered, 1
= registered and never unfenced, 2 = registered and fenced, 3 = registered
and unfenced. It obviously would add some more metrics for us to track, but
I think it would be more useful than a bunch of special-purpose metrics...
>
> best,
> Colin
>
>
> On Mon, Jan 27, 2025, at 10:56, Kevin Wu wrote:
> > Hey all,
> >
> > I posted a KIP to monitor broker startup and controlled shutdown on the
> > controller-side. Here's the link:
> >
https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup
> >
> > Best,
> > Kevin Wu
>

Reply via email to