Thanks for the clarification and explanation, Colin. Looks good to me. Ron
On Fri, Jun 2, 2023 at 2:52 PM Colin McCabe <cmcc...@apache.org> wrote: > > Hi Ron, > > Thanks for the review. > > On Fri, Jun 2, 2023, at 11:26, Ron Dagostino wrote: > > Thanks for the KIP, Colin. The KIP cals one metric > > "NewActiveControllersCount" but we don't append "Count" to the other > > metric names (e.g. it is "TimedOutBrokerHeartbeats" instead of > > "TimedOutBrokerHeartbeatsCount"). Should we be consistent (either use > > the suffix everywhere or don't use it anywhere)? > > > > That's a good point. I will add Count to the other metrics (except > CurrentMetadataVersion, which is not a count, of course) > > > The phrase "Note that only active controllers handle heartbeats, so > > only they will see increases in this metric." could apply elsewhere > > but is not mentioned elsewhere. For example, while standy controllers > > have a queue and will publish metrics related to it, the queue will > > remain empty (and the number of timeouts will remain constant) until a > > controller becomes active. Or is that incorrect? > > Hmm. I think the other metrics all do apply to standby controllers. Standby > controllers do perform operations (mainly operations to process new batches > of records, submitted by Raft). They could have timeouts (although it's > unlikely) and could load a snapshot if they get partitioned from the cluster > and then reuinted. > > > > > I wonder if we can be more generic and index these things by operation type. > > > > I think it's worth special-casing broker heartbeats here for a few reasons. > > - Their timeout is relatively short, so they're likely to be one of the first > internal operations to be impacted by high load. > > - By design, they happen pretty often so "broker heartbeat timeouts in X > minute" will be a pretty good proxy for load in X minute. The same can't be > said for other operations, which are more intermittent. > > - User-initiated operations like createTopics can have user-defined timeouts, > including very short ones, so such timeouts might just mean the user is > setting very short timeouts. So you can't really alert on that without > introducing false positives. > > I agree that if we want more timeout stats than just "heartbeats / > everything", we should consider a more principled system. But I don't think > we need it here (yet?) and it would expand the scope a lot, so let's not for > now. > > best, > Colin > > > > Ron > > > > On Thu, Jun 1, 2023 at 6:13 PM Colin McCabe <cmcc...@apache.org> wrote: > >> > >> Hi all, > >> > >> I posted a KIP to add some more metrics for measuring KRaft performance. > >> Take a look at: https://cwiki.apache.org/confluence/x/gBU0Dw > >> > >> best, > >> Colin