> > > 1. Spark 2.4 should also have support via the direct action API for > compaction (and the action API should be sufficient for me); but the class > pointed out > > <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java> > seems > to be an abstract class and I could not find an actual implementation in > Spark 2.4. Please correct me if I missed something. > > > 1. Action API should be sufficient for my purpose, thanks for pointing > out the unit tests showing how it works, but please verify if this is > available in Spark 2.4. > > I have checked this, It seems the deprecated Actions was having an implementation for rewrite data files. But it's new version SparkActions does not have the implementation for rewrite data files.
Deprecated class was removed as per this PR [ https://github.com/apache/iceberg/pull/3587]. I am not sure why the class was deprecated without having an alternate implementation. I just raised a PR to fix it [https://github.com/apache/iceberg/pull/3685/] Thanks, Ajantha On Wed, Dec 8, 2021 at 1:06 AM Puneet Zaroo <pza...@netflix.com.invalid> wrote: > Ajantha, Jack and Russell, > Thanks for the prompt replies. Just consolidating the information, my > understanding is: > > 1. Spark 2.4 should also have support via the direct action API for > compaction (and the action API should be sufficient for me); but the class > pointed out > > <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java> > seems to be an abstract class and I could not find an actual implementation > in Spark 2.4. Please correct me if I missed something. > 2. Action API should be sufficient for my purpose, thanks for pointing > out the unit tests showing how it works, but please verify if this is > available in Spark 2.4. > 3. Currently at Netflix we have a custom solution that compacts very > large partitions in an incremental manner via small batches; with the batch > size being configured outside of the spark job doing the actual merge. This > gives us more control over the resource consumption by the spark jobs. Over > time we would like to migrate over to using the Actions API instead; but > having the batching be completely controlled internally by the spark job > may not work out. I can look at if subpartition level filtering would take > care of this issue; but I doubt it will give us the granular control we > need. Perhaps having an option like the max number of files (or bytes) > to process would be better. > 4. I am not sure if snapshot expiry will by itself automatically > garbage collect the unnecessary delete files. For that to happen, I think > an explicit DELETE commit of the delete files needs to happen first; by an > action that verifies that the delete files are no longer needed in the > latest table snapshot. Perhaps there is some work happening to develop such > an action? I would love to look at any pending PRs for that effort. > > Thanks, > - Puneet > > On Mon, Dec 6, 2021 at 9:34 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> >> >> On Dec 6, 2021, at 9:02 PM, Puneet Zaroo <pza...@netflix.com.INVALID> >> wrote: >> >> Hi, >> I had a few questions related to compaction support, in particular >> compaction for CDC destination iceberg tables. Perhaps this information is >> available somewhere else, but I could not find it readily, so responses >> appreciated. >> >> 1. I believe compaction for the CDC use case will require iceberg >> version >= 0.13 (to pick up the change that maintains the same sequence >> numbers after compaction) and Spark version >= 3.0 (for the actual >> compaction action support). But please correct me if I'm wrong. >> >> This isn't strictly necessary but practically it may be depending on your >> CDC ingestion pace. Spark 2.4 contains an older implementation of the >> Compaction code which doesn't have the same feature set but can be used to >> compact datafiles. >> >> >> 1. How can the compaction action (via Spark) actually be triggered? >> Is it possible to specify filter predicate as well as the size and number >> of delete file thresholds for the compaction strategy via SQL statements >> or >> does one have to use the XXXRewriteDataFilesSparkAction classes directly >> from within a spark jar. >> >> There are two methods via Spark. >> 1. The Action Api >> >> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java >> // >> See examples in the test file here with all parameters being set as well >> 2. The SQL API (just included in master) >> >> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java >> >> >> 1. As far as I could understand from reading the code, the rewrite >> action processes all the data that matches a filter predicate (most likely >> a partition in practice). Internally the whole matched data is broken into >> smaller chunks which are processed concurrently. Any thoughts on setting a >> limit on the amount of work being done by the whole operation. I am >> worried >> about really large partitions where even though the whole operation is >> broken into chunks; it will take a long time to finish. >> >> Filters are usually the best way to limit the total size of the >> operation. Additionally we have the concept of "partial_progress" which >> allows the Rewrite to commit as it goes, rather than all at once at the >> end. This means you can terminate a job and still make progress. >> >> >> https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L33-L48 >> >> >> 1. The regular compaction will remove the need for equality and >> position delete files, but those files will still be around. Is there a >> separate compaction action being planned to actually remove the equality >> and position delete files? >> >> >> This is in progress, please check the dev list archives and Slack for >> more infoformation. >> >> Thanks, >> - Puneet >> >> >> Most of the delete work is still in progress and we are always looking >> for reviewers and developers to help out, so make sure to keep an eye on >> github. >> >> Russ >> >>