Re: [DISCUSS] KIP-938: Add more metrics for measuring KRaft performance

Colin McCabe Fri, 02 Jun 2023 11:52:32 -0700

Hi Ron,

Thanks for the review.

On Fri, Jun 2, 2023, at 11:26, Ron Dagostino wrote:
> Thanks for the KIP, Colin.  The KIP cals one metric
> "NewActiveControllersCount" but we don't append "Count" to the other
> metric names (e.g. it is "TimedOutBrokerHeartbeats" instead of
> "TimedOutBrokerHeartbeatsCount").  Should we be consistent (either use
> the suffix everywhere or don't use it anywhere)?
>

That's a good point. I will add Count to the other metrics (except 
CurrentMetadataVersion, which is not a count, of course)

> The phrase "Note that only active controllers handle heartbeats, so
> only they will see increases in this metric." could apply elsewhere
> but is not mentioned elsewhere.  For example, while standy controllers
> have a queue and will publish metrics related to it, the queue will
> remain empty (and the number of timeouts will remain constant) until a
> controller becomes active.  Or is that incorrect?

Hmm. I think the other metrics all do apply to standby controllers. Standby 
controllers do perform operations (mainly operations to process new batches of 
records, submitted by Raft). They could have timeouts (although it's unlikely) 
and could load a snapshot if they get partitioned from the cluster and then 
reuinted.

>
> I wonder if we can be more generic and index these things by operation type.
>

I think it's worth special-casing broker heartbeats here for a few reasons.

- Their timeout is relatively short, so they're likely to be one of the first 
internal operations to be impacted by high load.

- By design, they happen pretty often so "broker heartbeat timeouts in X 
minute" will be a pretty good proxy for load in X minute. The same can't be 
said for other operations, which are more intermittent.

- User-initiated operations like createTopics can have user-defined timeouts, 
including very short ones, so such timeouts might just mean the user is setting 
very short timeouts. So you can't really alert on that without introducing 
false positives.

I agree that if we want more timeout stats than just "heartbeats / everything", 
we should consider a more principled system. But I don't think we need it here 
(yet?) and it would expand the scope a lot, so let's not for now.

best,
Colin

> Ron
>
> On Thu, Jun 1, 2023 at 6:13 PM Colin McCabe <cmcc...@apache.org> wrote:
>>
>> Hi all,
>>
>> I posted a KIP to add some more metrics for measuring KRaft performance. 
>> Take a look at: https://cwiki.apache.org/confluence/x/gBU0Dw
>>
>> best,
>> Colin

Re: [DISCUSS] KIP-938: Add more metrics for measuring KRaft performance

Reply via email to