[
https://issues.apache.org/jira/browse/KAFKA-18615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alieh Saeedi reassigned KAFKA-18615:
------------------------------------
Assignee: Alieh Saeedi
> StreamThread *-ratio metrics suffer from sampling bias
> ------------------------------------------------------
>
> Key: KAFKA-18615
> URL: https://issues.apache.org/jira/browse/KAFKA-18615
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 3.8.1
> Reporter: Rafał Sumisławski
> Assignee: Alieh Saeedi
> Priority: Major
>
> h2. Background
> {{StreamThread}} defines {{{}commit-ratio{}}}, {{{}poll-ratio{}}},
> {{{}process-ratio{}}}, {{punctuate-ratio}} metrics here:
> [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/metrics/ThreadMetrics.java#L230-L288]
> These metrics indicate "The fraction of time the thread spent on \{action}".
> Unlike DefaultStateUpdater's ratio metrics, these metrics are "value"
> sensors. Meaning that the observable value of the metric is simply the last
> value registered before the act of observing the metric. This seems to avoid
> the ratio averaging issues I described in KAFKA-18369, but...
> h2. Issue
> Let's analyse an example scenario.
> For simplicity I will ignore existence of {{poll-ratio}} and
> {{{}punctuate-ratio{}}}, and just consider {{{}commit-ratio{}}}, and
> {{{}process-ratio{}}}.
> Let's say an external observer, be it a human reading JMX metrics, or an
> automated metric scraping solution, reads the metrics at a random point in
> time, uncorrelated with the inner workings of the kafka-streams application.
> The application itself works under a steady workload and the stream thread
> does 1000 iterations every 10 seconds, of which:
> * 999 iterations execute only {{process}} taking {{{}1ms{}}}, resulting in
> {{commit-ratio=0}}
> * 1 iteration executes {{process}} taking {{1ms}} and {{commit}} taking
> {{{}9000ms{}}}, resulting in {{commit-ratio=0.9998889012}}
> In no specific order. But we will assume that two committing iterations never
> happen one after another (the issue still exists without this assumption, the
> math just gets harder. Also the assumption is realistic given how kafka
> streams works).
> The ratio metrics are always updated at the end of an iteration. Therefore
> metric values corresponding to iteration number I, are visible for the
> duration of iteration number I+1. In that 10s period, there's only 1ms during
> which the {{commit-ratio=0.9998889012}} can be observed, as 1ms later one of
> the short iterations completes and overwrites the metric values. During the
> remaining 9999ms a {{commit-ratio=0}} would be observed. Therefore our random
> observer has 99.99% probability of observing a {{{}commit-ratio=0{}}}, even
> though the {{{}StreamThread{}}}, spends 90% of its time on {{commit}}
> h2. Solution
> This ticket is a sibling of KAFKA-18369 I wanted to report it as a separate
> ticket as these are different metrics, with currently different
> implementation, affected by a different problem that needs a separate
> explanation. But in my opinion the ratio metrics of {{StreamThread}} and
> {{DefaultStateUpdate}} should work, and be implemented the same way, so when
> it comes to a solution I will just refer to the ongoing discussion in
> KAFKA-18369.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)