> I just raised a PR to fix it [https://github.com/apache/iceberg/pull/3685/ ]
It seems it is not straight forward. Will have discussions with Russell and others in the PR and conclude. Thanks, Ajantha On Wed, Dec 8, 2021 at 11:43 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > >> 1. Spark 2.4 should also have support via the direct action API for >> compaction (and the action API should be sufficient for me); but the class >> pointed out >> >> <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java> >> seems >> to be an abstract class and I could not find an actual implementation in >> Spark 2.4. Please correct me if I missed something. >> >> >> 1. Action API should be sufficient for my purpose, thanks for >> pointing out the unit tests showing how it works, but please verify if >> this >> is available in Spark 2.4. >> >> I have checked this, It seems the deprecated Actions was having an > implementation for rewrite data files. > But it's new version SparkActions does not have the implementation for > rewrite data files. > > Deprecated class was removed as per this PR [ > https://github.com/apache/iceberg/pull/3587]. > I am not sure why the class was deprecated without having an alternate > implementation. > I just raised a PR to fix it [https://github.com/apache/iceberg/pull/3685/ > ] > > > Thanks, > Ajantha > > On Wed, Dec 8, 2021 at 1:06 AM Puneet Zaroo <pza...@netflix.com.invalid> > wrote: > >> Ajantha, Jack and Russell, >> Thanks for the prompt replies. Just consolidating the information, my >> understanding is: >> >> 1. Spark 2.4 should also have support via the direct action API for >> compaction (and the action API should be sufficient for me); but the class >> pointed out >> >> <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java> >> seems to be an abstract class and I could not find an actual >> implementation >> in Spark 2.4. Please correct me if I missed something. >> 2. Action API should be sufficient for my purpose, thanks for >> pointing out the unit tests showing how it works, but please verify if >> this >> is available in Spark 2.4. >> 3. Currently at Netflix we have a custom solution that compacts very >> large partitions in an incremental manner via small batches; with the >> batch >> size being configured outside of the spark job doing the actual merge. >> This >> gives us more control over the resource consumption by the spark jobs. >> Over >> time we would like to migrate over to using the Actions API instead; but >> having the batching be completely controlled internally by the spark job >> may not work out. I can look at if subpartition level filtering would take >> care of this issue; but I doubt it will give us the granular control we >> need. Perhaps having an option like the max number of files (or bytes) >> to process would be better. >> 4. I am not sure if snapshot expiry will by itself automatically >> garbage collect the unnecessary delete files. For that to happen, I think >> an explicit DELETE commit of the delete files needs to happen first; by an >> action that verifies that the delete files are no longer needed in the >> latest table snapshot. Perhaps there is some work happening to develop >> such >> an action? I would love to look at any pending PRs for that effort. >> >> Thanks, >> - Puneet >> >> On Mon, Dec 6, 2021 at 9:34 PM Russell Spitzer <russell.spit...@gmail.com> >> wrote: >> >>> >>> >>> On Dec 6, 2021, at 9:02 PM, Puneet Zaroo <pza...@netflix.com.INVALID> >>> wrote: >>> >>> Hi, >>> I had a few questions related to compaction support, in particular >>> compaction for CDC destination iceberg tables. Perhaps this information is >>> available somewhere else, but I could not find it readily, so responses >>> appreciated. >>> >>> 1. I believe compaction for the CDC use case will require iceberg >>> version >= 0.13 (to pick up the change that maintains the same sequence >>> numbers after compaction) and Spark version >= 3.0 (for the actual >>> compaction action support). But please correct me if I'm wrong. >>> >>> This isn't strictly necessary but practically it may be depending on >>> your CDC ingestion pace. Spark 2.4 contains an older implementation of the >>> Compaction code which doesn't have the same feature set but can be used to >>> compact datafiles. >>> >>> >>> 1. How can the compaction action (via Spark) actually be triggered? >>> Is it possible to specify filter predicate as well as the size and >>> number >>> of delete file thresholds for the compaction strategy via SQL statements >>> or >>> does one have to use the XXXRewriteDataFilesSparkAction classes directly >>> from within a spark jar. >>> >>> There are two methods via Spark. >>> 1. The Action Api >>> >>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java >>> // >>> See examples in the test file here with all parameters being set as well >>> 2. The SQL API (just included in master) >>> >>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java >>> >>> >>> 1. As far as I could understand from reading the code, the rewrite >>> action processes all the data that matches a filter predicate (most >>> likely >>> a partition in practice). Internally the whole matched data is broken >>> into >>> smaller chunks which are processed concurrently. Any thoughts on setting >>> a >>> limit on the amount of work being done by the whole operation. I am >>> worried >>> about really large partitions where even though the whole operation is >>> broken into chunks; it will take a long time to finish. >>> >>> Filters are usually the best way to limit the total size of the >>> operation. Additionally we have the concept of "partial_progress" which >>> allows the Rewrite to commit as it goes, rather than all at once at the >>> end. This means you can terminate a job and still make progress. >>> >>> >>> https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L33-L48 >>> >>> >>> 1. The regular compaction will remove the need for equality and >>> position delete files, but those files will still be around. Is there a >>> separate compaction action being planned to actually remove the equality >>> and position delete files? >>> >>> >>> This is in progress, please check the dev list archives and Slack for >>> more infoformation. >>> >>> Thanks, >>> - Puneet >>> >>> >>> Most of the delete work is still in progress and we are always looking >>> for reviewers and developers to help out, so make sure to keep an eye on >>> github. >>> >>> Russ >>> >>>