Hi Kevin, Thanks for the reply. Very insightful points.
KW1: Yes, using a single tagged metric makes sense. it's cleaner and more extensible. I'll adopt this approach. KW2: Yes, we don't need to use `CumulativeCount`. Already updated in the KIP KW3: I understand each timer is only meaningful in certain states, but the metric value is still useful for operational monitoring regardless of the current state. It tells you how many times a timeout has expired over the lifetime of the node. Hiding or clearing the metric when the node isn't in the relevant state could actually make it harder for users to diagnose historical issues, since they'd need to catch the metric while the node happens to be in the right state. For example, if a follower had repeated fetch timeout expirations and then transitions to a candidate/leader, the metrics would still be valuable for diagnosing why the leader election happened in the first place, right? If we cleared the metric on state transition, that information would be lost. The question is : Do we only want the metric to reflect only the latest state, or the overall timeout behavior over the node's lifetime? I lean toward the latter, as it provides more useful information for monitoring network issues. To avoid confusion, maybe we can use the metric name lifetime-timeout-count + tag timer-name=fetch/election? What do you think? On Thu, Apr 30, 2026 at 3:03 PM Kevin Wu <[email protected]> wrote: > Hi Tony, > > Thanks for the KIP. I agree that having metrics for timeouts in KRaft would > be a nice addition. I have a few high level comments about the KIP: > > KW1: Did you consider making a tagged metric like `number-of-timeouts` > instead of individual metrics? You could tag by the timer name (e.g. fetch, > election, update-voter, check-quorum, and begin-quorum-epoch etc.) since > KRaft supports several kinds of timers, and may add more in the future. You > can look at `NodeMetrics.java` and > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180*3A*Add*generic*feature*level*metrics__;JSsrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQFYUcvow$ > for an example of tagged metrics using Kafka's new metrics library. I think > there is an argument we should add timeout metrics for some of these other > KRaft timers I mentioned, since reporting them could also help operators > diagnose network partitions or possible software bugs. > > KW2: I see the "Type" for each metric is `CumulativeCount`. I think this > might be overkill, and that we could just use Integer for the data type, > and expose an increment method for each metric. In general, sensors are > used for when multiple metrics are associated with a specific concept (e.g. > `commit-latency-avg` and `commit-latency-max` are two different metrics > associated with the same concept of "commit latency"). It is hard for me to > imagine that the number of timeouts occurring would have more than one > metric associated with it. > > KW3: Each of these timers is associated with an EpochState (e.g. the fetch > timer with FollowerState, check quorum timer with LeaderState, etc.). What > should the value of these metrics be when a node transitions between > EpochStates? Should we stop reporting the metrics associated with the old > EpochState, and start reporting the metrics associated with the new > EpochState? I personally think it might be confusing if these metrics > report values even if the underlying timer does not exist on the node. For > example, the fetch timeout metric reporting a value when the local node is > the KRaft leader seems odd to me. When we added metrics for KIP-853 > associated with the leader (e.g. `uncommitted-voter-change`), we decided to > only report values for those metrics when the local node was the leader. It > would be nice if we could follow that convention for these metrics too, and > document which states report which metrics in the KIP. What do you think? > > Best, > Kevin Wu > > On Tue, Apr 21, 2026 at 12:32 PM Tony Tang via dev <[email protected]> > wrote: > > > Hello everyone, > > > > I'd like to start a discussion on KIP-1322: Add metrics to Kraft that > > measure the number of fetch timeouts and election timeouts < > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1322*3A*Add*metrics*to*Kraft*that*measure*the*number*of*fetch*timeouts*and*election*timeouts__;JSsrKysrKysrKysrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQLt1GBmw$ > > > > > > > This proposal aims to add new metrics to KRaft that track how often fetch > > timeouts and election timeouts occur. > > > > Best regards, > > Tony Tang > > >
