Rafał Sumisławski created KAFKA-18615: -----------------------------------------
Summary: StreamThread *-ratio metrics suffer from sampling bias Key: KAFKA-18615 URL: https://issues.apache.org/jira/browse/KAFKA-18615 Project: Kafka Issue Type: Bug Components: streams Affects Versions: 3.8.1 Reporter: Rafał Sumisławski h2. Background {{StreamThread}} defines {{{}commit-ratio{}}}, {{{}poll-ratio{}}}, {{{}process-ratio{}}}, {{punctuate-ratio}} metrics here: [https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/metrics/ThreadMetrics.java#L230-L288] These metrics indicate "The fraction of time the thread spent on \{action}". Unlike DefaultStateUpdater's ratio metrics, these metrics are "value" sensors. Meaning that the observable value of the metric is simply the last value registered before the act of observing the metric. This seems to avoid the ratio averaging issues I described in KAFKA-18369, but... h2. Issue Let's analyse an example scenario. For simplicity I will ignore existence of {{poll-ratio}} and {{{}punctuate-ratio{}}}, and just consider {{{}commit-ratio{}}}, and {{{}process-ratio{}}}. Let's say an external observer, be it a human reading JMX metrics, or an automated metric scraping solution, reads the metrics at a random point in time, uncorrelated with the inner workings of the kafka-streams application. The application itself works under a steady workload and the stream thread does 1000 iterations every 10 seconds, of which: * 999 iterations execute only {{process}} taking {{{}1ms{}}}, resulting in {{commit-ratio=0}} * 1 iteration executes {{process}} taking {{1ms}} and {{commit}} taking {{{}9000ms{}}}, resulting in {{commit-ratio=0.9998889012}} In no specific order. But we will assume that two committing iterations never happen one after another (the issue still exists without this assumption, the math just gets harder. Also the assumption is realistic given how kafka streams works). The ratio metrics are always updated at the end of an iteration. Therefore metric values corresponding to iteration number I, are visible for the duration of iteration number I+1. In that 10s period, there's only 1ms during which the {{commit-ratio=0.9998889012}} can be observed, as 1ms later one of the short iterations completes and overwrites the metric values. During the remaining 9999ms a {{commit-ratio=0}} would be observed. Therefore our random observer has 99.99% probability of observing a {{{}commit-ratio=0{}}}, even though the {{{}StreamThread{}}}, spends 90% of its time on {{commit}} h2. Solution This ticket is a sibling of KAFKA-18369 I wanted to report it as a separate ticket as these are different metrics, with currently different implementation, affected by a different problem that needs a separate explanation. But in my opinion the ratio metrics of {{StreamThread}} and {{DefaultStateUpdate}} should work, and be implemented the same way, so when it comes to a solution I will just refer to the ongoing discussion in KAFKA-18369. -- This message was sent by Atlassian Jira (v8.20.10#820010)