Re: [DISCUSS] KIP-938: Add more metrics for measuring KRaft performance

Colin McCabe Mon, 05 Jun 2023 13:26:16 -0700

On Sun, Jun 4, 2023, at 04:10, Divij Vaidya wrote:
> Thanks for the KIP Colin. I liked the rationale section which clearly
> explains why the metrics are required and what they might indicate.
>
> 1. I have a question about the "CurrentMetadataVersion" metric. Correct 
> me
> if I'm wrong, but I assume that we are referring to
> https://github.com/apache/kafka/blob/trunk/server-common/src/main/java/org/apache/kafka/server/common/MetadataVersion.java
> which is of type string or at best a floating point such as (3.3.1) !? I
> was wondering how we would represent it as integers that the KIP 
> proposes?
>


Hi Divij,

Thanks for the review.

Each metadata version has a corresponding integer in the MetadataVersion.java 
file.

> 2. This might be related to implementation but perhaps we could mention a
> design requirement for adding these metrics in the KIP as, "calculation of
> these metrics should not acquire any locks". I am asking this because in
> some places of the code (such as in LogCleanerManager), calculation of
> metrics acquires a lock which contends with the critical code path. Hence,
> with new metrics, I want us to be extra sure that calculation of metrics
> doesn't impact the core code path.
>

Good point. I added a note that we won't require any locks to read the metrics.

> 3. Again, perhaps goes in the implementation plan section of the KIP, but
> we should probably mention that we will add it to the documentation. I
> think adding the rationale like you described in this KIP will be very
> useful to the users as well.
>

I think that's kind of a given for metrics -- the docs should reflect what's in 
the latest KIP...

best,
Colin

>
> --
> Divij Vaidya
>
>
>
> On Fri, Jun 2, 2023 at 10:05 PM Ron Dagostino <rndg...@gmail.com> wrote:
>
>> Thanks for the clarification and explanation, Colin.  Looks good to me.
>>
>> Ron
>>
>> On Fri, Jun 2, 2023 at 2:52 PM Colin McCabe <cmcc...@apache.org> wrote:
>> >
>> > Hi Ron,
>> >
>> > Thanks for the review.
>> >
>> > On Fri, Jun 2, 2023, at 11:26, Ron Dagostino wrote:
>> > > Thanks for the KIP, Colin.  The KIP cals one metric
>> > > "NewActiveControllersCount" but we don't append "Count" to the other
>> > > metric names (e.g. it is "TimedOutBrokerHeartbeats" instead of
>> > > "TimedOutBrokerHeartbeatsCount").  Should we be consistent (either use
>> > > the suffix everywhere or don't use it anywhere)?
>> > >
>> >
>> > That's a good point. I will add Count to the other metrics (except
>> CurrentMetadataVersion, which is not a count, of course)
>> >
>> > > The phrase "Note that only active controllers handle heartbeats, so
>> > > only they will see increases in this metric." could apply elsewhere
>> > > but is not mentioned elsewhere.  For example, while standy controllers
>> > > have a queue and will publish metrics related to it, the queue will
>> > > remain empty (and the number of timeouts will remain constant) until a
>> > > controller becomes active.  Or is that incorrect?
>> >
>> > Hmm. I think the other metrics all do apply to standby controllers.
>> Standby controllers do perform operations (mainly operations to process new
>> batches of records, submitted by Raft). They could have timeouts (although
>> it's unlikely) and could load a snapshot if they get partitioned from the
>> cluster and then reuinted.
>> >
>> > >
>> > > I wonder if we can be more generic and index these things by operation
>> type.
>> > >
>> >
>> > I think it's worth special-casing broker heartbeats here for a few
>> reasons.
>> >
>> > - Their timeout is relatively short, so they're likely to be one of the
>> first internal operations to be impacted by high load.
>> >
>> > - By design, they happen pretty often so "broker heartbeat timeouts in X
>> minute" will be a pretty good proxy for load in X minute. The same can't be
>> said for other operations, which are more intermittent.
>> >
>> > - User-initiated operations like createTopics can have user-defined
>> timeouts, including very short ones, so such timeouts might just mean the
>> user is setting very short timeouts. So you can't really alert on that
>> without introducing false positives.
>> >
>> > I agree that if we want more timeout stats than just "heartbeats /
>> everything", we should consider a more principled system. But I don't think
>> we need it here (yet?) and it would expand the scope a lot, so let's not
>> for now.
>> >
>> > best,
>> > Colin
>> >
>> >
>> > > Ron
>> > >
>> > > On Thu, Jun 1, 2023 at 6:13 PM Colin McCabe <cmcc...@apache.org>
>> wrote:
>> > >>
>> > >> Hi all,
>> > >>
>> > >> I posted a KIP to add some more metrics for measuring KRaft
>> performance. Take a look at: https://cwiki.apache.org/confluence/x/gBU0Dw
>> > >>
>> > >> best,
>> > >> Colin
>>

Re: [DISCUSS] KIP-938: Add more metrics for measuring KRaft performance

Reply via email to