On Sun, Jun 4, 2023, at 04:10, Divij Vaidya wrote: > Thanks for the KIP Colin. I liked the rationale section which clearly > explains why the metrics are required and what they might indicate. > > 1. I have a question about the "CurrentMetadataVersion" metric. Correct > me > if I'm wrong, but I assume that we are referring to > https://github.com/apache/kafka/blob/trunk/server-common/src/main/java/org/apache/kafka/server/common/MetadataVersion.java > which is of type string or at best a floating point such as (3.3.1) !? I > was wondering how we would represent it as integers that the KIP > proposes? >
Hi Divij, Thanks for the review. Each metadata version has a corresponding integer in the MetadataVersion.java file. > 2. This might be related to implementation but perhaps we could mention a > design requirement for adding these metrics in the KIP as, "calculation of > these metrics should not acquire any locks". I am asking this because in > some places of the code (such as in LogCleanerManager), calculation of > metrics acquires a lock which contends with the critical code path. Hence, > with new metrics, I want us to be extra sure that calculation of metrics > doesn't impact the core code path. > Good point. I added a note that we won't require any locks to read the metrics. > 3. Again, perhaps goes in the implementation plan section of the KIP, but > we should probably mention that we will add it to the documentation. I > think adding the rationale like you described in this KIP will be very > useful to the users as well. > I think that's kind of a given for metrics -- the docs should reflect what's in the latest KIP... best, Colin > > -- > Divij Vaidya > > > > On Fri, Jun 2, 2023 at 10:05 PM Ron Dagostino <rndg...@gmail.com> wrote: > >> Thanks for the clarification and explanation, Colin. Looks good to me. >> >> Ron >> >> On Fri, Jun 2, 2023 at 2:52 PM Colin McCabe <cmcc...@apache.org> wrote: >> > >> > Hi Ron, >> > >> > Thanks for the review. >> > >> > On Fri, Jun 2, 2023, at 11:26, Ron Dagostino wrote: >> > > Thanks for the KIP, Colin. The KIP cals one metric >> > > "NewActiveControllersCount" but we don't append "Count" to the other >> > > metric names (e.g. it is "TimedOutBrokerHeartbeats" instead of >> > > "TimedOutBrokerHeartbeatsCount"). Should we be consistent (either use >> > > the suffix everywhere or don't use it anywhere)? >> > > >> > >> > That's a good point. I will add Count to the other metrics (except >> CurrentMetadataVersion, which is not a count, of course) >> > >> > > The phrase "Note that only active controllers handle heartbeats, so >> > > only they will see increases in this metric." could apply elsewhere >> > > but is not mentioned elsewhere. For example, while standy controllers >> > > have a queue and will publish metrics related to it, the queue will >> > > remain empty (and the number of timeouts will remain constant) until a >> > > controller becomes active. Or is that incorrect? >> > >> > Hmm. I think the other metrics all do apply to standby controllers. >> Standby controllers do perform operations (mainly operations to process new >> batches of records, submitted by Raft). They could have timeouts (although >> it's unlikely) and could load a snapshot if they get partitioned from the >> cluster and then reuinted. >> > >> > > >> > > I wonder if we can be more generic and index these things by operation >> type. >> > > >> > >> > I think it's worth special-casing broker heartbeats here for a few >> reasons. >> > >> > - Their timeout is relatively short, so they're likely to be one of the >> first internal operations to be impacted by high load. >> > >> > - By design, they happen pretty often so "broker heartbeat timeouts in X >> minute" will be a pretty good proxy for load in X minute. The same can't be >> said for other operations, which are more intermittent. >> > >> > - User-initiated operations like createTopics can have user-defined >> timeouts, including very short ones, so such timeouts might just mean the >> user is setting very short timeouts. So you can't really alert on that >> without introducing false positives. >> > >> > I agree that if we want more timeout stats than just "heartbeats / >> everything", we should consider a more principled system. But I don't think >> we need it here (yet?) and it would expand the scope a lot, so let's not >> for now. >> > >> > best, >> > Colin >> > >> > >> > > Ron >> > > >> > > On Thu, Jun 1, 2023 at 6:13 PM Colin McCabe <cmcc...@apache.org> >> wrote: >> > >> >> > >> Hi all, >> > >> >> > >> I posted a KIP to add some more metrics for measuring KRaft >> performance. Take a look at: https://cwiki.apache.org/confluence/x/gBU0Dw >> > >> >> > >> best, >> > >> Colin >>