> On Dec 6, 2021, at 9:02 PM, Puneet Zaroo <pza...@netflix.com.INVALID> wrote: > > Hi, > I had a few questions related to compaction support, in particular compaction > for CDC destination iceberg tables. Perhaps this information is available > somewhere else, but I could not find it readily, so responses appreciated. > I believe compaction for the CDC use case will require iceberg version >= > 0.13 (to pick up the change that maintains the same sequence numbers after > compaction) and Spark version >= 3.0 (for the actual compaction action > support). But please correct me if I'm wrong. This isn't strictly necessary but practically it may be depending on your CDC ingestion pace. Spark 2.4 contains an older implementation of the Compaction code which doesn't have the same feature set but can be used to compact datafiles. > How can the compaction action (via Spark) actually be triggered? Is it > possible to specify filter predicate as well as the size and number of > delete file thresholds for the compaction strategy via SQL statements or does > one have to use the XXXRewriteDataFilesSparkAction classes directly from > within a spark jar. There are two methods via Spark. 1. The Action Api https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java <https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java> // See examples in the test file here with all parameters being set as well 2. The SQL API (just included in master) https://github.com/apache/iceberg/blob/master/spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java <https://github.com/apache/iceberg/blob/master/spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java> > As far as I could understand from reading the code, the rewrite action > processes all the data that matches a filter predicate (most likely a > partition in practice). Internally the whole matched data is broken into > smaller chunks which are processed concurrently. Any thoughts on setting a > limit on the amount of work being done by the whole operation. I am worried > about really large partitions where even though the whole operation is broken > into chunks; it will take a long time to finish. Filters are usually the best way to limit the total size of the operation. Additionally we have the concept of "partial_progress" which allows the Rewrite to commit as it goes, rather than all at once at the end. This means you can terminate a job and still make progress.
https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L33-L48 > The regular compaction will remove the need for equality and position delete > files, but those files will still be around. Is there a separate compaction > action being planned to actually remove the equality and position delete > files? This is in progress, please check the dev list archives and Slack for more infoformation. > Thanks, > - Puneet > Most of the delete work is still in progress and we are always looking for reviewers and developers to help out, so make sure to keep an eye on github. Russ