Regarding 4, when you run a RewriteDataFiles, the MergingSnapshotProducer automatically drops a delete file if there is no data file of lower sequence number: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L523-L525. This serves as a soft-delete of delete files, making it a DELETED entry in manifest.
When you run ExpireSnapshots, I think there is no place where we are explicitly checking if a manifest entry needs to be a DataFile but not a DeleteFile. We are retrieving all manifests, and use ManifestEntry<?> all over the place. This means delete files are hard-deleted in the same way as data files. I think we can more eagerly expire delete files before expiring a snapshot, that's definitely a place that would be great if anyone likes to take a look. I have provided more details in https://github.com/apache/iceberg/pull/3432/files in the last section. -Jack On Tue, Dec 7, 2021 at 11:36 AM Puneet Zaroo <pza...@netflix.com.invalid> wrote: > Ajantha, Jack and Russell, > Thanks for the prompt replies. Just consolidating the information, my > understanding is: > > 1. Spark 2.4 should also have support via the direct action API for > compaction (and the action API should be sufficient for me); but the class > pointed out > > <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java> > seems to be an abstract class and I could not find an actual implementation > in Spark 2.4. Please correct me if I missed something. > 2. Action API should be sufficient for my purpose, thanks for pointing > out the unit tests showing how it works, but please verify if this is > available in Spark 2.4. > 3. Currently at Netflix we have a custom solution that compacts very > large partitions in an incremental manner via small batches; with the batch > size being configured outside of the spark job doing the actual merge. This > gives us more control over the resource consumption by the spark jobs. Over > time we would like to migrate over to using the Actions API instead; but > having the batching be completely controlled internally by the spark job > may not work out. I can look at if subpartition level filtering would take > care of this issue; but I doubt it will give us the granular control we > need. Perhaps having an option like the max number of files (or bytes) > to process would be better. > 4. I am not sure if snapshot expiry will by itself automatically > garbage collect the unnecessary delete files. For that to happen, I think > an explicit DELETE commit of the delete files needs to happen first; by an > action that verifies that the delete files are no longer needed in the > latest table snapshot. Perhaps there is some work happening to develop such > an action? I would love to look at any pending PRs for that effort. > > Thanks, > - Puneet > > On Mon, Dec 6, 2021 at 9:34 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> >> >> On Dec 6, 2021, at 9:02 PM, Puneet Zaroo <pza...@netflix.com.INVALID> >> wrote: >> >> Hi, >> I had a few questions related to compaction support, in particular >> compaction for CDC destination iceberg tables. Perhaps this information is >> available somewhere else, but I could not find it readily, so responses >> appreciated. >> >> 1. I believe compaction for the CDC use case will require iceberg >> version >= 0.13 (to pick up the change that maintains the same sequence >> numbers after compaction) and Spark version >= 3.0 (for the actual >> compaction action support). But please correct me if I'm wrong. >> >> This isn't strictly necessary but practically it may be depending on your >> CDC ingestion pace. Spark 2.4 contains an older implementation of the >> Compaction code which doesn't have the same feature set but can be used to >> compact datafiles. >> >> >> 1. How can the compaction action (via Spark) actually be triggered? >> Is it possible to specify filter predicate as well as the size and number >> of delete file thresholds for the compaction strategy via SQL statements >> or >> does one have to use the XXXRewriteDataFilesSparkAction classes directly >> from within a spark jar. >> >> There are two methods via Spark. >> 1. The Action Api >> >> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java >> // >> See examples in the test file here with all parameters being set as well >> 2. The SQL API (just included in master) >> >> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java >> >> >> 1. As far as I could understand from reading the code, the rewrite >> action processes all the data that matches a filter predicate (most likely >> a partition in practice). Internally the whole matched data is broken into >> smaller chunks which are processed concurrently. Any thoughts on setting a >> limit on the amount of work being done by the whole operation. I am >> worried >> about really large partitions where even though the whole operation is >> broken into chunks; it will take a long time to finish. >> >> Filters are usually the best way to limit the total size of the >> operation. Additionally we have the concept of "partial_progress" which >> allows the Rewrite to commit as it goes, rather than all at once at the >> end. This means you can terminate a job and still make progress. >> >> >> https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L33-L48 >> >> >> 1. The regular compaction will remove the need for equality and >> position delete files, but those files will still be around. Is there a >> separate compaction action being planned to actually remove the equality >> and position delete files? >> >> >> This is in progress, please check the dev list archives and Slack for >> more infoformation. >> >> Thanks, >> - Puneet >> >> >> Most of the delete work is still in progress and we are always looking >> for reviewers and developers to help out, so make sure to keep an eye on >> github. >> >> Russ >> >>