[
https://issues.apache.org/jira/browse/HDDS-12682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17938282#comment-17938282
]
Peter Lee edited comment on HDDS-12682 at 3/27/25 8:19 AM:
-----------------------------------------------------------
[~swamirishi] I recall you mentioning that RocksDB 9.x includes something like
this proposal which can be configured with a few arguments. However, I couldn’t
find related content in the RocksDB release notes. Do you have any recollection
of a specific keyword that might help me track it down? I noticed that RocksDB
has a feature called `CompactOnDeletionCollector` (introduced in version 4.x).
Could this be what you were referring to?
I also observed that TiKV continues to maintain [manual compaction on
ranges|#L30-L52] and is iterating on their
[strategy|https://github.com/tikv/tikv/commits/master/components/raftstore/src/store/worker/compact.rs].
This suggests it might still be worth exploring, especially with a
comprehensive benchmark.
I’d love to hear your thoughts on this. Thank you!
was (Author: JIRAUSER306303):
[~ritesh] I recall you mentioning that RocksDB 9.x includes something like this
proposal which can be configured with a few arguments. However, I couldn’t find
related content in the RocksDB release notes. Do you have any recollection of a
specific keyword that might help me track it down? I noticed that RocksDB has a
feature called `CompactOnDeletionCollector` (introduced in version 4.x). Could
this be what you were referring to?
I also observed that TiKV continues to maintain [manual compaction on
ranges|#L30-L52] and is iterating on their
[strategy|https://github.com/tikv/tikv/commits/master/components/raftstore/src/store/worker/compact.rs].
This suggests it might still be worth exploring, especially with a
comprehensive benchmark.
I’d love to hear your thoughts on this. Thank you!
> Aggressive DB Compaction with Minimal Degradation
> -------------------------------------------------
>
> Key: HDDS-12682
> URL: https://issues.apache.org/jira/browse/HDDS-12682
> Project: Apache Ozone
> Issue Type: Improvement
> Components: db, OM
> Reporter: Peter Lee
> Assignee: Peter Lee
> Priority: Major
>
> After researching TiKV and RocksDB compaction, I have some thoughts on the
> current OM compaction:
> 1. TiKV also runs a background task to perform compaction.
> 2. If we directly compact an entire column family (cf), it seems it would
> impact online performance (write amplification).
> 3. We can use the built-in `TableProperties` in SST to check the
> `num_entries` and `num_deletion` of an SST file, but these two metrics only
> represent the number of operations and don’t deduplicate keys.
> 4. TiKV has implemented a custom `MVCTablePropertiesCollector`, which
> includes deduplication for more accurate results. However, the Java API
> currently doesn’t seem to support custom `TablePropertiesCollector` 💩, so
> we’re forced to make do with the built-in statistical data.
> 5. TiKV logically splits key ranges (table regions, with a default size limit
> of 256 MB per region), allowing it to gradually scan and compact known ranges.
> - [Compaction key range
> paging](https://github.com/tikv/tikv/pull/2631/files#diff-49d2597226cac1291163478f47bee5d4530bd4b9b84d322059e8afaf7dd3dedcR1896-R1938)
> - [Check if each key range needs
> compaction](https://github.com/tikv/tikv/pull/2631/files#diff-52d5655c2ce5a05afae67d216f55e98a1d71c971e1869628b7ebe387dda90a37R203-R217)
> If we want to apply this logic to the Ozone Manager (OM), since OM isn’t a
> distributed KV store, it doesn’t have the concept of key ranges. The only
> logical key range division we can use is the bucket prefix (file table).
> However, there’s an even better point: for FSO buckets, we can further divide
> key ranges based on the directory `parent_id`.
> So, I think for `KeyTable` compaction, we could compact each bucket
> individually and keep track of a `next_bucket` for paging. For
> directory-related tables, we could also compact each bucket, but if a bucket
> turns out to be too large, we could further compact based on the ordered
> `parent_id` key ranges. This would require two paging keys: `next_bucket` and
> `next_parent_id`.
> - [TableProperties
> class](https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/TableProperties.java#L12)
> - [`public Map<String, TableProperties> getPropertiesOfTablesInRange(final
> ColumnFamilyHandle columnFamilyHandle, final List<Range>
> ranges)`](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/RocksDB.java#L4575)
> - [Range
> Class](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/Range.java)
> The Java APIs currently provide some support that would allow us to implement
> the above ideas.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]