Some questions related to compaction support.

Puneet Zaroo Mon, 06 Dec 2021 19:03:14 -0800

Hi,
I had a few questions related to compaction support, in particular
compaction for CDC destination iceberg tables. Perhaps this information is
available somewhere else, but I could not find it readily, so responses
appreciated.


   1. I believe compaction for the CDC use case will require iceberg
   version >= 0.13 (to pick up the change that maintains the same sequence
   numbers after compaction) and Spark version >= 3.0 (for the actual
   compaction action support). But please correct me if I'm wrong.
   2. How can the compaction action (via Spark) actually be triggered? Is
   it possible to  specify filter predicate as well as the size and number of
   delete file thresholds for the compaction strategy via SQL statements or
   does one have to use the XXXRewriteDataFilesSparkAction classes directly
   from within a spark jar.
   3. As far as I could understand from reading the code, the rewrite
   action processes all the data that matches a filter predicate (most likely
   a partition in practice). Internally the whole matched data is broken into
   smaller chunks which are processed concurrently. Any thoughts on setting a
   limit on the amount of work being done by the whole operation. I am worried
   about really large partitions where even though the whole operation is
   broken into chunks; it will take a long time to finish.
   4. The regular compaction will remove the need for equality and position
   delete files, but those files will still be around. Is there a separate
   compaction action being planned to actually remove the equality and
   position delete files?

Thanks,
- Puneet

Some questions related to compaction support.

Reply via email to