Peter Lee created HDDS-12682:
--------------------------------

             Summary: Aggressive DB Compaction with Minimal Degradation
                 Key: HDDS-12682
                 URL: https://issues.apache.org/jira/browse/HDDS-12682
             Project: Apache Ozone
          Issue Type: Improvement
          Components: db, OM
            Reporter: Peter Lee
            Assignee: Peter Lee


After researching TiKV and RocksDB compaction, I have some thoughts on the 
current OM compaction:

1. TiKV also runs a background task to perform compaction.
2. If we directly compact an entire column family (cf), it seems it would 
impact online performance (write amplification).
3. We can use the built-in `TableProperties` in SST to check the `num_entries` 
and `num_deletion` of an SST file, but these two metrics only represent the 
number of operations and don’t deduplicate keys.
4. TiKV has implemented a custom `MVCTablePropertiesCollector`, which includes 
deduplication for more accurate results. However, the Java API currently 
doesn’t seem to support custom `TablePropertiesCollector` 💩, so we’re forced to 
make do with the built-in statistical data.
5. TiKV logically splits key ranges (table regions, with a default size limit 
of 256 MB per region), allowing it to gradually scan and compact known ranges.
   - [Compaction key range 
paging](https://github.com/tikv/tikv/pull/2631/files#diff-49d2597226cac1291163478f47bee5d4530bd4b9b84d322059e8afaf7dd3dedcR1896-R1938)
   - [Check if each key range needs 
compaction](https://github.com/tikv/tikv/pull/2631/files#diff-52d5655c2ce5a05afae67d216f55e98a1d71c971e1869628b7ebe387dda90a37R203-R217)

If we want to apply this logic to the Ozone Manager (OM), since OM isn’t a 
distributed KV store, it doesn’t have the concept of key ranges. The only 
logical key range division we can use is the bucket prefix (file table). 
However, there’s an even better point: for FSO buckets, we can further divide 
key ranges based on the directory `parent_id`.

So, I think for `KeyTable` compaction, we could compact each bucket 
individually and keep track of a `next_bucket` for paging. For 
directory-related tables, we could also compact each bucket, but if a bucket 
turns out to be too large, we could further compact based on the ordered 
`parent_id` key ranges. This would require two paging keys: `next_bucket` and 
`next_parent_id`.

- [TableProperties 
class](https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/TableProperties.java#L12)
- [`public Map<String, TableProperties> getPropertiesOfTablesInRange(final 
ColumnFamilyHandle columnFamilyHandle, final List<Range> 
ranges)`](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/RocksDB.java#L4575)
- [Range 
Class](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/Range.java)

The Java APIs currently provide some support that would allow us to implement 
the above ideas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to