Ajantha, Thanks for reverting the change that removed the compaction action from Spark 2.4. I just wanted to understand the difference between the compaction action in Spark 2.4 and Spark >= 3.0 . What is functionally different between the two implementations? In other words, what would we miss out on if we went with 2.4 instead of 3.0 or higher.
thanks, - Puneet On Tue, Dec 7, 2021 at 10:34 PM Ajantha Bhat <ajanthab...@gmail.com> wrote: > > I just raised a PR to fix it [ > https://github.com/apache/iceberg/pull/3685/] > > It seems it is not straight forward. Will have discussions with Russell > and others in the PR and conclude. > > Thanks, > Ajantha > > On Wed, Dec 8, 2021 at 11:43 AM Ajantha Bhat <ajanthab...@gmail.com> > wrote: > >> >>> 1. Spark 2.4 should also have support via the direct action API for >>> compaction (and the action API should be sufficient for me); but the >>> class >>> pointed out >>> >>> <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java> >>> seems >>> to be an abstract class and I could not find an actual implementation in >>> Spark 2.4. Please correct me if I missed something. >>> >>> >>> 1. Action API should be sufficient for my purpose, thanks for >>> pointing out the unit tests showing how it works, but please verify if >>> this >>> is available in Spark 2.4. >>> >>> I have checked this, It seems the deprecated Actions was having an >> implementation for rewrite data files. >> But it's new version SparkActions does not have the implementation for >> rewrite data files. >> >> Deprecated class was removed as per this PR [ >> https://github.com/apache/iceberg/pull/3587]. >> I am not sure why the class was deprecated without having an alternate >> implementation. >> I just raised a PR to fix it [ >> https://github.com/apache/iceberg/pull/3685/] >> >> >> Thanks, >> Ajantha >> >> On Wed, Dec 8, 2021 at 1:06 AM Puneet Zaroo <pza...@netflix.com.invalid> >> wrote: >> >>> Ajantha, Jack and Russell, >>> Thanks for the prompt replies. Just consolidating the information, my >>> understanding is: >>> >>> 1. Spark 2.4 should also have support via the direct action API for >>> compaction (and the action API should be sufficient for me); but the >>> class >>> pointed out >>> >>> <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java> >>> seems to be an abstract class and I could not find an actual >>> implementation >>> in Spark 2.4. Please correct me if I missed something. >>> 2. Action API should be sufficient for my purpose, thanks for >>> pointing out the unit tests showing how it works, but please verify if >>> this >>> is available in Spark 2.4. >>> 3. Currently at Netflix we have a custom solution that compacts very >>> large partitions in an incremental manner via small batches; with the >>> batch >>> size being configured outside of the spark job doing the actual merge. >>> This >>> gives us more control over the resource consumption by the spark jobs. >>> Over >>> time we would like to migrate over to using the Actions API instead; but >>> having the batching be completely controlled internally by the spark job >>> may not work out. I can look at if subpartition level filtering would >>> take >>> care of this issue; but I doubt it will give us the granular control we >>> need. Perhaps having an option like the max number of files (or bytes) >>> to process would be better. >>> 4. I am not sure if snapshot expiry will by itself automatically >>> garbage collect the unnecessary delete files. For that to happen, I think >>> an explicit DELETE commit of the delete files needs to happen first; by >>> an >>> action that verifies that the delete files are no longer needed in the >>> latest table snapshot. Perhaps there is some work happening to develop >>> such >>> an action? I would love to look at any pending PRs for that effort. >>> >>> Thanks, >>> - Puneet >>> >>> On Mon, Dec 6, 2021 at 9:34 PM Russell Spitzer < >>> russell.spit...@gmail.com> wrote: >>> >>>> >>>> >>>> On Dec 6, 2021, at 9:02 PM, Puneet Zaroo <pza...@netflix.com.INVALID> >>>> wrote: >>>> >>>> Hi, >>>> I had a few questions related to compaction support, in particular >>>> compaction for CDC destination iceberg tables. Perhaps this information is >>>> available somewhere else, but I could not find it readily, so responses >>>> appreciated. >>>> >>>> 1. I believe compaction for the CDC use case will require iceberg >>>> version >= 0.13 (to pick up the change that maintains the same sequence >>>> numbers after compaction) and Spark version >= 3.0 (for the actual >>>> compaction action support). But please correct me if I'm wrong. >>>> >>>> This isn't strictly necessary but practically it may be depending on >>>> your CDC ingestion pace. Spark 2.4 contains an older implementation of the >>>> Compaction code which doesn't have the same feature set but can be used to >>>> compact datafiles. >>>> >>>> >>>> 1. How can the compaction action (via Spark) actually be triggered? >>>> Is it possible to specify filter predicate as well as the size and >>>> number >>>> of delete file thresholds for the compaction strategy via SQL >>>> statements or >>>> does one have to use the XXXRewriteDataFilesSparkAction classes directly >>>> from within a spark jar. >>>> >>>> There are two methods via Spark. >>>> 1. The Action Api >>>> >>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java >>>> // >>>> See examples in the test file here with all parameters being set as well >>>> 2. The SQL API (just included in master) >>>> >>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java >>>> >>>> >>>> 1. As far as I could understand from reading the code, the rewrite >>>> action processes all the data that matches a filter predicate (most >>>> likely >>>> a partition in practice). Internally the whole matched data is broken >>>> into >>>> smaller chunks which are processed concurrently. Any thoughts on >>>> setting a >>>> limit on the amount of work being done by the whole operation. I am >>>> worried >>>> about really large partitions where even though the whole operation is >>>> broken into chunks; it will take a long time to finish. >>>> >>>> Filters are usually the best way to limit the total size of the >>>> operation. Additionally we have the concept of "partial_progress" which >>>> allows the Rewrite to commit as it goes, rather than all at once at the >>>> end. This means you can terminate a job and still make progress. >>>> >>>> >>>> https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L33-L48 >>>> >>>> >>>> 1. The regular compaction will remove the need for equality and >>>> position delete files, but those files will still be around. Is there a >>>> separate compaction action being planned to actually remove the equality >>>> and position delete files? >>>> >>>> >>>> This is in progress, please check the dev list archives and Slack for >>>> more infoformation. >>>> >>>> Thanks, >>>> - Puneet >>>> >>>> >>>> Most of the delete work is still in progress and we are always looking >>>> for reviewers and developers to help out, so make sure to keep an eye on >>>> github. >>>> >>>> Russ >>>> >>>>