Thanks for the KIP Colin. I liked the rationale section which clearly explains why the metrics are required and what they might indicate.
1. I have a question about the "CurrentMetadataVersion" metric. Correct me if I'm wrong, but I assume that we are referring to https://github.com/apache/kafka/blob/trunk/server-common/src/main/java/org/apache/kafka/server/common/MetadataVersion.java which is of type string or at best a floating point such as (3.3.1) !? I was wondering how we would represent it as integers that the KIP proposes? 2. This might be related to implementation but perhaps we could mention a design requirement for adding these metrics in the KIP as, "calculation of these metrics should not acquire any locks". I am asking this because in some places of the code (such as in LogCleanerManager), calculation of metrics acquires a lock which contends with the critical code path. Hence, with new metrics, I want us to be extra sure that calculation of metrics doesn't impact the core code path. 3. Again, perhaps goes in the implementation plan section of the KIP, but we should probably mention that we will add it to the documentation. I think adding the rationale like you described in this KIP will be very useful to the users as well. -- Divij Vaidya On Fri, Jun 2, 2023 at 10:05 PM Ron Dagostino <rndg...@gmail.com> wrote: > Thanks for the clarification and explanation, Colin. Looks good to me. > > Ron > > On Fri, Jun 2, 2023 at 2:52 PM Colin McCabe <cmcc...@apache.org> wrote: > > > > Hi Ron, > > > > Thanks for the review. > > > > On Fri, Jun 2, 2023, at 11:26, Ron Dagostino wrote: > > > Thanks for the KIP, Colin. The KIP cals one metric > > > "NewActiveControllersCount" but we don't append "Count" to the other > > > metric names (e.g. it is "TimedOutBrokerHeartbeats" instead of > > > "TimedOutBrokerHeartbeatsCount"). Should we be consistent (either use > > > the suffix everywhere or don't use it anywhere)? > > > > > > > That's a good point. I will add Count to the other metrics (except > CurrentMetadataVersion, which is not a count, of course) > > > > > The phrase "Note that only active controllers handle heartbeats, so > > > only they will see increases in this metric." could apply elsewhere > > > but is not mentioned elsewhere. For example, while standy controllers > > > have a queue and will publish metrics related to it, the queue will > > > remain empty (and the number of timeouts will remain constant) until a > > > controller becomes active. Or is that incorrect? > > > > Hmm. I think the other metrics all do apply to standby controllers. > Standby controllers do perform operations (mainly operations to process new > batches of records, submitted by Raft). They could have timeouts (although > it's unlikely) and could load a snapshot if they get partitioned from the > cluster and then reuinted. > > > > > > > > I wonder if we can be more generic and index these things by operation > type. > > > > > > > I think it's worth special-casing broker heartbeats here for a few > reasons. > > > > - Their timeout is relatively short, so they're likely to be one of the > first internal operations to be impacted by high load. > > > > - By design, they happen pretty often so "broker heartbeat timeouts in X > minute" will be a pretty good proxy for load in X minute. The same can't be > said for other operations, which are more intermittent. > > > > - User-initiated operations like createTopics can have user-defined > timeouts, including very short ones, so such timeouts might just mean the > user is setting very short timeouts. So you can't really alert on that > without introducing false positives. > > > > I agree that if we want more timeout stats than just "heartbeats / > everything", we should consider a more principled system. But I don't think > we need it here (yet?) and it would expand the scope a lot, so let's not > for now. > > > > best, > > Colin > > > > > > > Ron > > > > > > On Thu, Jun 1, 2023 at 6:13 PM Colin McCabe <cmcc...@apache.org> > wrote: > > >> > > >> Hi all, > > >> > > >> I posted a KIP to add some more metrics for measuring KRaft > performance. Take a look at: https://cwiki.apache.org/confluence/x/gBU0Dw > > >> > > >> best, > > >> Colin >