Re: [DISCUSS] KIP-938: Add more metrics for measuring KRaft performance

Ron Dagostino Fri, 02 Jun 2023 13:05:38 -0700

Thanks for the clarification and explanation, Colin.  Looks good to me.

Ron


On Fri, Jun 2, 2023 at 2:52 PM Colin McCabe <cmcc...@apache.org> wrote:
>
> Hi Ron,
>
> Thanks for the review.
>
> On Fri, Jun 2, 2023, at 11:26, Ron Dagostino wrote:
> > Thanks for the KIP, Colin.  The KIP cals one metric
> > "NewActiveControllersCount" but we don't append "Count" to the other
> > metric names (e.g. it is "TimedOutBrokerHeartbeats" instead of
> > "TimedOutBrokerHeartbeatsCount").  Should we be consistent (either use
> > the suffix everywhere or don't use it anywhere)?
> >
>
> That's a good point. I will add Count to the other metrics (except 
> CurrentMetadataVersion, which is not a count, of course)
>
> > The phrase "Note that only active controllers handle heartbeats, so
> > only they will see increases in this metric." could apply elsewhere
> > but is not mentioned elsewhere.  For example, while standy controllers
> > have a queue and will publish metrics related to it, the queue will
> > remain empty (and the number of timeouts will remain constant) until a
> > controller becomes active.  Or is that incorrect?
>
> Hmm. I think the other metrics all do apply to standby controllers. Standby 
> controllers do perform operations (mainly operations to process new batches 
> of records, submitted by Raft). They could have timeouts (although it's 
> unlikely) and could load a snapshot if they get partitioned from the cluster 
> and then reuinted.
>
> >
> > I wonder if we can be more generic and index these things by operation type.
> >
>
> I think it's worth special-casing broker heartbeats here for a few reasons.
>
> - Their timeout is relatively short, so they're likely to be one of the first 
> internal operations to be impacted by high load.
>
> - By design, they happen pretty often so "broker heartbeat timeouts in X 
> minute" will be a pretty good proxy for load in X minute. The same can't be 
> said for other operations, which are more intermittent.
>
> - User-initiated operations like createTopics can have user-defined timeouts, 
> including very short ones, so such timeouts might just mean the user is 
> setting very short timeouts. So you can't really alert on that without 
> introducing false positives.
>
> I agree that if we want more timeout stats than just "heartbeats / 
> everything", we should consider a more principled system. But I don't think 
> we need it here (yet?) and it would expand the scope a lot, so let's not for 
> now.
>
> best,
> Colin
>
>
> > Ron
> >
> > On Thu, Jun 1, 2023 at 6:13 PM Colin McCabe <cmcc...@apache.org> wrote:
> >>
> >> Hi all,
> >>
> >> I posted a KIP to add some more metrics for measuring KRaft performance. 
> >> Take a look at: https://cwiki.apache.org/confluence/x/gBU0Dw
> >>
> >> best,
> >> Colin

Re: [DISCUSS] KIP-938: Add more metrics for measuring KRaft performance

Reply via email to