[jira] [Commented] (KAFKA-18369) State updater's *-ratio metrics are incorrect

Lucas Brutschy (Jira) Thu, 02 Jan 2025 02:13:11 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-18369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17909259#comment-17909259
 ]


Lucas Brutschy commented on KAFKA-18369:
----------------------------------------

Hi [~sumislawski]. I agree that something is off here. However, not sure (yet), 
we have to solve it by reporting different metrics. Note that metrics are part 
of the public interface of Kafka and thus require a KIP to be changed.

Should we report an average here in the first place? If we look at the 
`StreamThread`, we report similar metrics, but do not use an average stat, but 
report the value directly:

https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/metrics/ThreadMetrics.java#L228

If you look at the KIP that introduced these metrics, there is no mention of 
averaging the ratios. 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-869%3A+Improve+Streams+State+Restoration+Visibility

In summary, I agree that this is a bug, but I'd propose to fix it by removing 
the averaging.

> State updater's *-ratio metrics are incorrect
> ---------------------------------------------
>
>                 Key: KAFKA-18369
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18369
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.8.1
>            Reporter: Rafał Sumisławski
>            Priority: Major
>
> h2. Background
> {{DefaultStateUpdater}} defines {{{}idle-ratio{}}}, 
> {{{}active-restore-ratio{}}}, {{{}standby-update-ratio{}}}, 
> {{checkpoint-ratio}} metrics here: 
> [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/DefaultStateUpdater.java#L1101-L1115]
>  These metrics are averages, that are supposed to indicate "The fraction of 
> time the thread spent on \{action}". But the metrics don't actually do that.
> h2. Issue
> Let me explain this with an example:
> For simplicity's sake, let's consider the following example involving just 
> {{{}standby-update-ratio{}}}, {{checkpoint-ratio}} and ignoring the existence 
> of the other two metrics.
> Let's say the thread did:
>  * {{999}} iterations with {{standby-update}} taking {{1ms}} in each 
> iteration and no {{checkpoint}} happening ({{{}0ms{}}}).
>  * {{1}} iteration with {{standby-update}} taking {{1ms}} and {{checkpoint}} 
> taking {{9000ms}}
> The thread spent {{10s}} working, of which it spent {{1s}} on 
> {{standby-updates}} and {{9s}} on checkpoint, so the fraction of time it 
> spent on checkpoint (checkpoint-ratio) is {{{}~0.001 (0.1%){}}}. Or at least 
> that is what the metrics will say. I would instead argue that it spent 
> {{9s/10s == 0.9 == 90%}} on checkpoint. If you agree with my logic, then you 
> agree that this metrics is incorrect.
> The problem is that the code computes a ratio for each iteration, and then 
> averages those ratios out, producing a number devoid of statistical meaning 
> or practical application. It ignores the fact that the one iteration that 
> took {{{}9s{}}}, should have a much higher weight than those quick 1ms 
> iterations.
> {{(999*(0ms/1ms) + 1*(9000ms/9001ms))/1000 ~= 0.001}}
> h2. Solution
> What we would like to see instead is either a ratio of average/total times, 
> not an average of ratios. I don't think this can be easily realised within 
> the existing metrics system. So instead, what I propose as a solution is to 
> report {{duration-total}} and/or {{duration-rate}} (with the unit of seconds 
> per second) metric for each of {{{}idle{}}}, {{{}active-restore{}}}, 
> {{{}standby-restore{}}}, {{{}checkpoint{}}}. The observers of these metrics, 
> when needed, could then derive the actual ratio of time spent on each 
> operation for example as {{{}checkpoint-ratio = checkpoint-duration-rate / 
> (idle-duration-rate + active-restore-duration-rate + 
> standby-restore-duration-rate + checkpoint-duration-rate){}}}. Or by 
> performing an analogical calculation on deltas of the {{total}} metrics.
> I can submit a PR once there's an agreement on the correct way to fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-18369) State updater's *-ratio metrics are incorrect

Reply via email to