Hi Kevin, Thanks for the reply.
KW3: Ok, I agree that metric values at time X should be an accurate snapshot of the node at time X, and that collecting historical values is not the responsibility of kafka. The approach you described makes sense to me: we keep the internal counter across state transitions, but only register/unregister the metric based on whether the underlying timer exists. I'm on board with that approach KW4: That's a great suggestion. However, I think it's out of scope for this KIP. I'd prefer to keep this KIP focused on timeout expiration metrics and maybe we can discuss the state transition counting in the future. Best, Tony On Fri, May 1, 2026 at 5:12 PM Kevin Wu <[email protected]> wrote: > Hi Tony, > > Thanks for the reply. > > KW3: I don't think "count" metrics like the ones being discussed in this > KIP should report values when the objects they are associated with do not > exist. This would mean metric values at time X are not an accurate > "snapshot" of the Kafka node at time X. In my opinion, collecting the > historic values for a metric, visualizing them through dashboards, and > monitoring them to alert operators are the responsibilities of a downstream > observability software, not Kafka. Kafka does have the capability to create > "derivative" metrics (i.e. metrics whose values are based off of sampling > something else over time) via Sensors, but I don't think that fits our use > case as previously discussed. > > Another way to think about it is that adding or removing a metric so Kafka > starts or stops reporting a value to an observability service actually > tells the user more information about Kafka compared to unconditionally > reporting said metric value. Additionally, just because you remove a metric > does not mean you need to remove the corresponding counter within the > KafkaRaftMetrics object. For example, a node fetch request times out 5 > times as a follower, so the node reports 5 for the metric. Next, the node > becomes the leader, so it stops reporting the metric (i.e. the metric has > no value). Then it becomes a follower, and starts reporting 5 again. When > the next fetch timeout happens, the node reports 6 for the metric. What do > you think about this behavior? > > KW4: If the desire is for a metric that always reports a value, what do you > think about a metric that counts the number of `EpochState` transitions? I > think this value makes sense to report this value for the lifetime of a > process, and generally, frequent state transitions are an indication > something is wrong with the cluster. This would be an additional metric to > the ones we discussed above. > > Best, > Kevin Wu > > On Fri, May 1, 2026 at 12:59 PM Tony Tang via dev <[email protected]> > wrote: > > > Hi Kevin, > > > > Thanks for the reply. Very insightful points. > > > > KW1: Yes, using a single tagged metric makes sense. it's cleaner and more > > extensible. I'll adopt this approach. > > > > KW2: Yes, we don't need to use `CumulativeCount`. Already updated in the > > KIP > > > > KW3: I understand each timer is only meaningful in certain states, but > the > > metric value is still useful for operational monitoring regardless of the > > current state. It tells you how many times a timeout has expired over the > > lifetime of the node. Hiding or clearing the metric when the node isn't > in > > the relevant state could actually make it harder for users to diagnose > > historical issues, since they'd need to catch the metric while the node > > happens to be in the right state. For example, if a follower had repeated > > fetch timeout expirations and then transitions to a candidate/leader, the > > metrics would still be valuable for diagnosing why the leader election > > happened in the first place, right? If we cleared the metric on state > > transition, that information would be lost. The question is : Do we only > > want the metric to reflect only the latest state, or the overall timeout > > behavior over the node's lifetime? I lean toward the latter, as it > provides > > more useful information for monitoring network issues. To avoid > confusion, > > maybe we can use the metric name lifetime-timeout-count + tag > > timer-name=fetch/election? What do you think? > > > > > > > > > > On Thu, Apr 30, 2026 at 3:03 PM Kevin Wu <[email protected]> wrote: > > > > > Hi Tony, > > > > > > Thanks for the KIP. I agree that having metrics for timeouts in KRaft > > would > > > be a nice addition. I have a few high level comments about the KIP: > > > > > > KW1: Did you consider making a tagged metric like `number-of-timeouts` > > > instead of individual metrics? You could tag by the timer name (e.g. > > fetch, > > > election, update-voter, check-quorum, and begin-quorum-epoch etc.) > since > > > KRaft supports several kinds of timers, and may add more in the future. > > You > > > can look at `NodeMetrics.java` and > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1180*3A*Add*generic*feature*level*metrics__;JSsrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQFYUcvow$ > > > for an example of tagged metrics using Kafka's new metrics library. I > > think > > > there is an argument we should add timeout metrics for some of these > > other > > > KRaft timers I mentioned, since reporting them could also help > operators > > > diagnose network partitions or possible software bugs. > > > > > > KW2: I see the "Type" for each metric is `CumulativeCount`. I think > this > > > might be overkill, and that we could just use Integer for the data > type, > > > and expose an increment method for each metric. In general, sensors are > > > used for when multiple metrics are associated with a specific concept > > (e.g. > > > `commit-latency-avg` and `commit-latency-max` are two different metrics > > > associated with the same concept of "commit latency"). It is hard for > me > > to > > > imagine that the number of timeouts occurring would have more than one > > > metric associated with it. > > > > > > KW3: Each of these timers is associated with an EpochState (e.g. the > > fetch > > > timer with FollowerState, check quorum timer with LeaderState, etc.). > > What > > > should the value of these metrics be when a node transitions between > > > EpochStates? Should we stop reporting the metrics associated with the > old > > > EpochState, and start reporting the metrics associated with the new > > > EpochState? I personally think it might be confusing if these metrics > > > report values even if the underlying timer does not exist on the node. > > For > > > example, the fetch timeout metric reporting a value when the local node > > is > > > the KRaft leader seems odd to me. When we added metrics for KIP-853 > > > associated with the leader (e.g. `uncommitted-voter-change`), we > decided > > to > > > only report values for those metrics when the local node was the > leader. > > It > > > would be nice if we could follow that convention for these metrics too, > > and > > > document which states report which metrics in the KIP. What do you > think? > > > > > > Best, > > > Kevin Wu > > > > > > On Tue, Apr 21, 2026 at 12:32 PM Tony Tang via dev < > [email protected] > > > > > > wrote: > > > > > > > Hello everyone, > > > > > > > > I'd like to start a discussion on KIP-1322: Add metrics to Kraft that > > > > measure the number of fetch timeouts and election timeouts < > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1322*3A*Add*metrics*to*Kraft*that*measure*the*number*of*fetch*timeouts*and*election*timeouts__;JSsrKysrKysrKysrKysr!!Ayb5sqE7!qPjsZ_186iR3QjEak9hmexMYOhwzDGzvcLYwnVUujYlAy2wAAQchfvSKMr9oG7Mygg608Vz6zFCv5QDQLt1GBmw$ > > > > > > > > > > > > > This proposal aims to add new metrics to KRaft that track how often > > fetch > > > > timeouts and election timeouts occur. > > > > > > > > Best regards, > > > > Tony Tang > > > > > > > > > >
