[ https://issues.apache.org/jira/browse/KAFKA-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17915826#comment-17915826 ]
Matthias J. Sax commented on KAFKA-18369: ----------------------------------------- Did not think about it in detail... – Guess it's up to whoever picks-up this ticket to figure it out and make a proposal. :) But yes, something link `WindowedAvg` I guess (which would internally re-use `WindowedSum`)? > State updater's *-ratio metrics are incorrect > --------------------------------------------- > > Key: KAFKA-18369 > URL: https://issues.apache.org/jira/browse/KAFKA-18369 > Project: Kafka > Issue Type: Bug > Components: streams > Affects Versions: 3.8.1 > Reporter: Rafał Sumisławski > Priority: Major > > h2. Background > {{DefaultStateUpdater}} defines {{{}idle-ratio{}}}, > {{{}active-restore-ratio{}}}, {{{}standby-update-ratio{}}}, > {{checkpoint-ratio}} metrics here: > [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L1101-L1115] > These metrics are averages, that are supposed to indicate "The fraction of > time the thread spent on \{action}". But the metrics don't actually do that. > h2. Issue > Let me explain this with an example: > For simplicity's sake, let's consider the following example involving just > {{{}standby-update-ratio{}}}, {{checkpoint-ratio}} and ignoring the existence > of the other two metrics. > Let's say the thread did: > * {{999}} iterations with {{standby-update}} taking {{1ms}} in each > iteration and no {{checkpoint}} happening ({{{}0ms{}}}). > * {{1}} iteration with {{standby-update}} taking {{1ms}} and {{checkpoint}} > taking {{9000ms}} > The thread spent {{10s}} working, of which it spent {{1s}} on > {{standby-updates}} and {{9s}} on checkpoint, so the fraction of time it > spent on checkpoint (checkpoint-ratio) is {{{}~0.001 (0.1%){}}}. Or at least > that is what the metrics will say. I would instead argue that it spent > {{9s/10s == 0.9 == 90%}} on checkpoint. If you agree with my logic, then you > agree that this metrics is incorrect. > The problem is that the code computes a ratio for each iteration, and then > averages those ratios out, producing a number devoid of statistical meaning > or practical application. It ignores the fact that the one iteration that > took {{{}9s{}}}, should have a much higher weight than those quick 1ms > iterations. > {{(999*(0ms/1ms) + 1*(9000ms/9001ms))/1000 ~= 0.001}} > h2. Solution > What we would like to see instead is either a ratio of average/total times, > not an average of ratios. I don't think this can be easily realised within > the existing metrics system. So instead, what I propose as a solution is to > report {{duration-total}} and/or {{duration-rate}} (with the unit of seconds > per second) metric for each of {{{}idle{}}}, {{{}active-restore{}}}, > {{{}standby-restore{}}}, {{{}checkpoint{}}}. The observers of these metrics, > when needed, could then derive the actual ratio of time spent on each > operation for example as {{{}checkpoint-ratio = checkpoint-duration-rate / > (idle-duration-rate + active-restore-duration-rate + > standby-restore-duration-rate + checkpoint-duration-rate){}}}. Or by > performing an analogical calculation on deltas of the {{total}} metrics. > I can submit a PR once there's an agreement on the correct way to fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010)