[
https://issues.apache.org/jira/browse/HDDS-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903501#comment-17903501
]
Ethan Rose commented on HDDS-10110:
-----------------------------------
To capture a few more ideas here:
* We should explicitly name any metric derived from RocksDB key count estimates
as an estimate.
* Recon should publish metrics about volume/bucket/key counts (if it does not
already) and these should be used as the source of truth for namespace counts.
> Use RocksDB key count estimates instead of OM metrics file
> ----------------------------------------------------------
>
> Key: HDDS-10110
> URL: https://issues.apache.org/jira/browse/HDDS-10110
> Project: Apache Ozone
> Issue Type: Sub-task
> Components: OM
> Reporter: Ethan Rose
> Assignee: Ethan Rose
> Priority: Major
> Labels: pull-request-available
>
> HDDS-816 added a json file in the OM to store persisted metrics like key
> count. The Jira has a doc attached that compares some options and decides
> that periodically flushing to a json file is the best approach. However, it
> neglects many issues with saving metrics this way:
> * Error handling was missed. See HDDS-10094
> * OMs' metrics can diverge if OMs are restarted at different times between
> flushes of the file.
> * On snapshot install on a follower, the metric will be [reset to estimated
> row|https://github.com/apache/ozone/blob/14e7ff1e6fb2bf11f1df054c63b6e1729e328286/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java#L4006]
> count anyways. This follower will now have diverged metrics from the other
> OMs.
> * When metrics for various OMs diverge, they will show different lines in
> dashboarding applications like Grafana, which may be confusing for users.
> * Restoring the metric to a correct value after bugs like HDDS-10063 requires
> some sort of manual repair.
> * Once metrics diverge between OMs, even a restart will not bring them back
> in sync.
> [HDDS-1829|https://issues.apache.org/jira/browse/HDDS-1829] later added the
> ability for some metrics to be updated based on RocksDB key count estimates.
> See {{Q: How to know the number of keys stored in a RocksDB database?}}
> [RocksDB FAQ|https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ]. These
> metrics survive restart using the key count estimate and do not use the
> metrics json file, so we have two divergent implementations. However, once
> these metrics are updated on startup, they are not incremented as new OM
> operations come in.
> This jira proposes:
> # Get rid of the OM metrics json file.
> # Use key count estimates for all metrics that must survive a restart.
> # Continue to update these metrics as OM requests come in.
> While the RocksDB estimated key count will not be totally accurate, the json
> based approach will not be either. The RocksDB approach is easier to maintain
> both in terms of code required and fixing metric counting bugs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]