George Wu created KAFKA-19484:
---------------------------------
Summary: Tiered Storage Quota Metrics can stop reporting
Key: KAFKA-19484
URL: https://issues.apache.org/jira/browse/KAFKA-19484
Project: Kafka
Issue Type: Bug
Components: Tiered-Storage
Affects Versions: 4.0.0, 3.9.0
Environment: Ubuntu 22, Amazon Corretto Java 17
Reporter: George Wu
It is possible for tiered storage throttle metrics (introduced as a part of
[KIP-956|https://cwiki.apache.org/confluence/display/KAFKA/KIP-956+Tiered+Storage+Quotas])
to stop reporting if the relevant tiered storage operation (copy/fetch) goes
idle for longer than the sensor expiry timeout of one hour.
RemoteLogManager maintains a static reference to the sensors used for metric
reporting. This is a problem because the default sensor expiry time is one hour
and there is nothing responsible for handling expired sensors. If the sensors
expire, RemoteLogManager will continue producing metrics through it's static
references to sensor objects that have already been cleaned up by the
ExpireSensorTask.
This issue tends to affect fetch metrics a lot more than copy metrics because
the copy sensors don't go idle unless the topics stop being produced to. In
contrast, the use case of backfilling from earliest offset using tiered storage
is a pretty common use case.
*Reproduction*
* Generate some amount of tiered storage fetch traffic on a topic. Confirm the
remote-fetch-throttle-time-avg/max metrics are being reported.
* Remove the consumer workload that triggers the tiered storage fetch traffic.
Wait for one hour (the sensor expiration period)
* Generate some more tiered storage fetch traffic. The metric will no longer
report.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)