Peter Lee created HDDS-12682:
--------------------------------
Summary: Aggressive DB Compaction with Minimal Degradation
Key: HDDS-12682
URL: https://issues.apache.org/jira/browse/HDDS-12682
Project: Apache Ozone
Issue Type: Improvement
Components: db, OM
Reporter: Peter Lee
Assignee: Peter Lee
After researching TiKV and RocksDB compaction, I have some thoughts on the
current OM compaction:
1. TiKV also runs a background task to perform compaction.
2. If we directly compact an entire column family (cf), it seems it would
impact online performance (write amplification).
3. We can use the built-in `TableProperties` in SST to check the `num_entries`
and `num_deletion` of an SST file, but these two metrics only represent the
number of operations and don’t deduplicate keys.
4. TiKV has implemented a custom `MVCTablePropertiesCollector`, which includes
deduplication for more accurate results. However, the Java API currently
doesn’t seem to support custom `TablePropertiesCollector` 💩, so we’re forced to
make do with the built-in statistical data.
5. TiKV logically splits key ranges (table regions, with a default size limit
of 256 MB per region), allowing it to gradually scan and compact known ranges.
- [Compaction key range
paging](https://github.com/tikv/tikv/pull/2631/files#diff-49d2597226cac1291163478f47bee5d4530bd4b9b84d322059e8afaf7dd3dedcR1896-R1938)
- [Check if each key range needs
compaction](https://github.com/tikv/tikv/pull/2631/files#diff-52d5655c2ce5a05afae67d216f55e98a1d71c971e1869628b7ebe387dda90a37R203-R217)
If we want to apply this logic to the Ozone Manager (OM), since OM isn’t a
distributed KV store, it doesn’t have the concept of key ranges. The only
logical key range division we can use is the bucket prefix (file table).
However, there’s an even better point: for FSO buckets, we can further divide
key ranges based on the directory `parent_id`.
So, I think for `KeyTable` compaction, we could compact each bucket
individually and keep track of a `next_bucket` for paging. For
directory-related tables, we could also compact each bucket, but if a bucket
turns out to be too large, we could further compact based on the ordered
`parent_id` key ranges. This would require two paging keys: `next_bucket` and
`next_parent_id`.
- [TableProperties
class](https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/TableProperties.java#L12)
- [`public Map<String, TableProperties> getPropertiesOfTablesInRange(final
ColumnFamilyHandle columnFamilyHandle, final List<Range>
ranges)`](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/RocksDB.java#L4575)
- [Range
Class](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/Range.java)
The Java APIs currently provide some support that would allow us to implement
the above ideas.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]