Hi Kevin,

Thanks for the KIP.

I notice that you have some metrics that reflect times here, such as 
LongestPendingStartupTimeMs, LongestPendingControlledShudownTimeMs, etc. I 
think this may be difficult to do with complete accuracy because we don't 
include times in the metadata log events for registration changes. If we just 
do the obvious thing and make the times "soft state" then these times will be 
reset when there is a controller failover.

Perhaps it would be simpler to cut out the metrics that include a time and just 
have NumberOfBrokersInStartup and NumberOfBrokersInControlledShutdown ? Then 
people could set up an alert on these metrics. For example, set up an alert 
that fires if NumberOfBrokersInStartup is non-zero for more than 5 minutes.

I wonder if it would be a good idea to have a per-broker metric on the 
controller that showed the state of each broker. Like 0 = not registered, 1 = 
registered and never unfenced, 2 = registered and fenced, 3 = registered and 
unfenced. It obviously would add some more metrics for us to track, but I think 
it would be more useful than a bunch of special-purpose metrics...

best,
Colin


On Mon, Jan 27, 2025, at 10:56, Kevin Wu wrote:
> Hey all,
>
> I posted a KIP to monitor broker startup and controlled shutdown on the
> controller-side. Here's the link:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1131%3A+Controller-side+monitoring+for+broker+shutdown+and+startup
>
> Best,
> Kevin Wu

Reply via email to