[jira] [Created] (HDDS-12682) Aggressive DB Compaction with Minimal Degradation

Peter Lee (Jira) Mon, 24 Mar 2025 07:18:14 -0700

Peter Lee created HDDS-12682:
--------------------------------

             Summary: Aggressive DB Compaction with Minimal Degradation
                 Key: HDDS-12682
                 URL: https://issues.apache.org/jira/browse/HDDS-12682
             Project: Apache Ozone
          Issue Type: Improvement
          Components: db, OM
            Reporter: Peter Lee
            Assignee: Peter Lee

After researching TiKV and RocksDB compaction, I have some thoughts on the
current OM compaction:

1. TiKV also runs a background task to perform compaction.
2. If we directly compact an entire column family (cf), it seems it would
impact online performance (write amplification).
3. We can use the built-in `TableProperties` in SST to check the `num_entries`
and `num_deletion` of an SST file, but these two metrics only represent the
number of operations and don’t deduplicate keys.
4. TiKV has implemented a custom `MVCTablePropertiesCollector`, which includes
deduplication for more accurate results. However, the Java API currently
doesn’t seem to support custom `TablePropertiesCollector` 💩, so we’re forced to
make do with the built-in statistical data.
5. TiKV logically splits key ranges (table regions, with a default size limit
of 256 MB per region), allowing it to gradually scan and compact known ranges.
- [Compaction key range
paging](https://github.com/tikv/tikv/pull/2631/files#diff-49d2597226cac1291163478f47bee5d4530bd4b9b84d322059e8afaf7dd3dedcR1896-R1938)
- [Check if each key range needs
compaction](https://github.com/tikv/tikv/pull/2631/files#diff-52d5655c2ce5a05afae67d216f55e98a1d71c971e1869628b7ebe387dda90a37R203-R217)

If we want to apply this logic to the Ozone Manager (OM), since OM isn’t a
distributed KV store, it doesn’t have the concept of key ranges. The only
logical key range division we can use is the bucket prefix (file table).
However, there’s an even better point: for FSO buckets, we can further divide
key ranges based on the directory `parent_id`.

So, I think for `KeyTable` compaction, we could compact each bucket
individually and keep track of a `next_bucket` for paging. For
directory-related tables, we could also compact each bucket, but if a bucket
turns out to be too large, we could further compact based on the ordered
`parent_id` key ranges. This would require two paging keys: `next_bucket` and
`next_parent_id`.

- [TableProperties
class](https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/TableProperties.java#L12)
- [`public Map<String, TableProperties> getPropertiesOfTablesInRange(final
ColumnFamilyHandle columnFamilyHandle, final List<Range>
ranges)`](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/RocksDB.java#L4575)
- [Range
Class](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/Range.java)

The Java APIs currently provide some support that would allow us to implement
the above ideas.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-12682) Aggressive DB Compaction with Minimal Degradation

Reply via email to