[jira] [Assigned] (KAFKA-18615) StreamThread *-ratio metrics suffer from sampling bias

Alieh Saeedi (Jira) Fri, 12 Dec 2025 03:38:07 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-18615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alieh Saeedi reassigned KAFKA-18615:
------------------------------------

    Assignee: Alieh Saeedi

> StreamThread *-ratio metrics suffer from sampling bias
> ------------------------------------------------------
>
>                 Key: KAFKA-18615
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18615
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.8.1
>            Reporter: Rafał Sumisławski
>            Assignee: Alieh Saeedi
>            Priority: Major
>
> h2. Background
> {{StreamThread}} defines {{{}commit-ratio{}}}, {{{}poll-ratio{}}}, 
> {{{}process-ratio{}}}, {{punctuate-ratio}} metrics here: 
> [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/metrics/ThreadMetrics.java#L230-L288]
>  These metrics indicate "The fraction of time the thread spent on \{action}". 
> Unlike DefaultStateUpdater's ratio metrics, these metrics are "value" 
> sensors. Meaning that the observable value of the metric is simply the last 
> value registered before the act of observing the metric. This seems to avoid 
> the ratio averaging issues I described in KAFKA-18369, but...
> h2. Issue
> Let's analyse an example scenario.
> For simplicity I will ignore existence of {{poll-ratio}} and 
> {{{}punctuate-ratio{}}}, and just consider {{{}commit-ratio{}}}, and 
> {{{}process-ratio{}}}.
> Let's say an external observer, be it a human reading JMX metrics, or an 
> automated metric scraping solution, reads the metrics at a random point in 
> time, uncorrelated with the inner workings of the kafka-streams application.
> The application itself works under a steady workload and the stream thread 
> does 1000 iterations every 10 seconds, of which:
>  * 999 iterations execute only {{process}} taking {{{}1ms{}}}, resulting in 
> {{commit-ratio=0}}
>  * 1 iteration executes {{process}} taking {{1ms}} and {{commit}} taking 
> {{{}9000ms{}}}, resulting in {{commit-ratio=0.9998889012}}
> In no specific order. But we will assume that two committing iterations never 
> happen one after another (the issue still exists without this assumption, the 
> math just gets harder. Also the assumption is realistic given how kafka 
> streams works).
> The ratio metrics are always updated at the end of an iteration. Therefore 
> metric values corresponding to iteration number I, are visible for the 
> duration of iteration number I+1. In that 10s period, there's only 1ms during 
> which the {{commit-ratio=0.9998889012}} can be observed, as 1ms later one of 
> the short iterations completes and overwrites the metric values. During the 
> remaining 9999ms a {{commit-ratio=0}} would be observed. Therefore our random 
> observer has 99.99% probability of observing a {{{}commit-ratio=0{}}}, even 
> though the {{{}StreamThread{}}}, spends 90% of its time on {{commit}}
> h2. Solution
> This ticket is a sibling of KAFKA-18369 I wanted to report it as a separate 
> ticket as these are different metrics, with currently different 
> implementation, affected by a different problem that needs a separate 
> explanation. But in my opinion the ratio metrics of {{StreamThread}} and 
> {{DefaultStateUpdate}} should work, and be implemented the same way, so when 
> it comes to a solution I will just refer to the ongoing discussion in 
> KAFKA-18369.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (KAFKA-18615) StreamThread *-ratio metrics suffer from sampling bias

Reply via email to