For clarification, Ajantha is correct about 4, I just mean we can remove delete files more eagerly using an additional procedure, but normal snapshot expiration still works. -Jack
On Mon, Dec 6, 2021 at 9:22 PM Jack Ye <yezhao...@gmail.com> wrote: > 1. Yes, you are correct. > 2. We just added the SQL procedure call, if you don't want to directly > invoke the action via Spark: > https://github.com/apache/iceberg/blob/master/site/docs/spark-procedures.md?plain=1#L243 > 3. The filter is a data filter, it does not need to be at partition > boundary, you can refine the filter based on your compute resource > 4. The position and equality delete files are no longer a part of the read > path so they should not have much impact on performance, but we do have > plans to improve our snapshot expiration procedure (likely by adding > another procedure dedicated to removing deletes) to clean up those files. > > -Jack > > On Mon, Dec 6, 2021 at 7:06 PM Puneet Zaroo <pza...@netflix.com.invalid> > wrote: > >> Hi, >> I had a few questions related to compaction support, in particular >> compaction for CDC destination iceberg tables. Perhaps this information is >> available somewhere else, but I could not find it readily, so responses >> appreciated. >> >> 1. I believe compaction for the CDC use case will require iceberg >> version >= 0.13 (to pick up the change that maintains the same sequence >> numbers after compaction) and Spark version >= 3.0 (for the actual >> compaction action support). But please correct me if I'm wrong. >> 2. How can the compaction action (via Spark) actually be triggered? >> Is it possible to specify filter predicate as well as the size and number >> of delete file thresholds for the compaction strategy via SQL statements >> or >> does one have to use the XXXRewriteDataFilesSparkAction classes directly >> from within a spark jar. >> 3. As far as I could understand from reading the code, the rewrite >> action processes all the data that matches a filter predicate (most likely >> a partition in practice). Internally the whole matched data is broken into >> smaller chunks which are processed concurrently. Any thoughts on setting a >> limit on the amount of work being done by the whole operation. I am >> worried >> about really large partitions where even though the whole operation is >> broken into chunks; it will take a long time to finish. >> 4. The regular compaction will remove the need for equality and >> position delete files, but those files will still be around. Is there a >> separate compaction action being planned to actually remove the equality >> and position delete files? >> >> Thanks, >> - Puneet >> >>