[ 
https://issues.apache.org/jira/browse/HDDS-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903501#comment-17903501
 ] 

Ethan Rose commented on HDDS-10110:
-----------------------------------

To capture a few more ideas here:
* We should explicitly name any metric derived from RocksDB key count estimates 
as an estimate.
* Recon should publish metrics about volume/bucket/key counts (if it does not 
already) and these should be used as the source of truth for namespace counts.
 

> Use RocksDB key count estimates instead of OM metrics file
> ----------------------------------------------------------
>
>                 Key: HDDS-10110
>                 URL: https://issues.apache.org/jira/browse/HDDS-10110
>             Project: Apache Ozone
>          Issue Type: Sub-task
>          Components: OM
>            Reporter: Ethan Rose
>            Assignee: Ethan Rose
>            Priority: Major
>              Labels: pull-request-available
>
> HDDS-816 added a json file in the OM to store persisted metrics like key 
> count. The Jira has a doc attached that compares some options and decides 
> that periodically flushing to a json file is the best approach. However, it 
> neglects many issues with saving metrics this way:
> * Error handling was missed. See HDDS-10094
> * OMs' metrics can diverge if OMs are restarted at different times between 
> flushes of the file.
> * On snapshot install on a follower, the metric will be [reset to estimated 
> row|https://github.com/apache/ozone/blob/14e7ff1e6fb2bf11f1df054c63b6e1729e328286/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java#L4006]
>  count anyways. This follower will now have diverged metrics from the other 
> OMs.
> * When metrics for various OMs diverge, they will show different lines in 
> dashboarding applications like Grafana, which may be confusing for users.
> * Restoring the metric to a correct value after bugs like HDDS-10063 requires 
> some sort of manual repair.
> * Once metrics diverge between OMs, even a restart will not bring them back 
> in sync.
> [HDDS-1829|https://issues.apache.org/jira/browse/HDDS-1829] later added the 
> ability for some metrics to be updated based on RocksDB key count estimates. 
> See {{Q: How to know the number of keys stored in a RocksDB database?}} 
> [RocksDB FAQ|https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ]. These 
> metrics survive restart using the key count estimate and do not use the 
> metrics json file, so we have two divergent implementations. However, once 
> these metrics are updated on startup, they are not incremented as new OM 
> operations come in.
> This jira proposes:
> # Get rid of the OM metrics json file.
> # Use key count estimates for all metrics that must survive a restart.
> # Continue to update these metrics as OM requests come in.
> While the RocksDB estimated key count will not be totally accurate, the json 
> based approach will not be either. The RocksDB approach is easier to maintain 
> both in terms of code required and fixing metric counting bugs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to