Re: Some questions related to compaction support.

Puneet Zaroo Thu, 09 Dec 2021 10:52:40 -0800

Ajantha,
Thanks for reverting the change that removed the compaction action from
Spark 2.4. I just wanted to understand the difference between the
compaction action in Spark 2.4 and Spark >= 3.0 . What is functionally
different between the two implementations? In other words, what would we
miss out on if we went with 2.4 instead of 3.0 or higher.


thanks,
- Puneet


On Tue, Dec 7, 2021 at 10:34 PM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> > I just raised a PR to fix it [
> https://github.com/apache/iceberg/pull/3685/]
>
> It seems it is not straight forward. Will have discussions with Russell
> and others in the PR and conclude.
>
> Thanks,
> Ajantha
>
> On Wed, Dec 8, 2021 at 11:43 AM Ajantha Bhat <ajanthab...@gmail.com>
> wrote:
>
>>
>>>    1. Spark 2.4 should also have support via the direct action API for
>>>    compaction (and the action API should be sufficient for me); but the 
>>> class
>>>    pointed out
>>>    
>>> <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java>
>>>  seems
>>>    to be an abstract class and I could not find an actual implementation in
>>>    Spark 2.4. Please correct me if I missed something.
>>>
>>>
>>>    1. Action API should be sufficient for my purpose, thanks for
>>>    pointing out the unit tests showing how it works, but please verify if 
>>> this
>>>    is available in Spark 2.4.
>>>
>>> I have checked this, It seems the deprecated Actions was having an
>> implementation for rewrite data files.
>> But it's new version SparkActions does not have the implementation for
>> rewrite data files.
>>
>> Deprecated class was removed as per this PR [
>> https://github.com/apache/iceberg/pull/3587].
>> I am not sure why the class was deprecated without having an alternate
>> implementation.
>> I just raised a PR to fix it [
>> https://github.com/apache/iceberg/pull/3685/]
>>
>>
>> Thanks,
>> Ajantha
>>
>> On Wed, Dec 8, 2021 at 1:06 AM Puneet Zaroo <pza...@netflix.com.invalid>
>> wrote:
>>
>>> Ajantha, Jack and Russell,
>>> Thanks for the prompt replies. Just consolidating the information, my
>>> understanding is:
>>>
>>>    1. Spark 2.4 should also have support via the direct action API for
>>>    compaction (and the action API should be sufficient for me); but the 
>>> class
>>>    pointed out
>>>    
>>> <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java>
>>>    seems to be an abstract class and I could not find an actual 
>>> implementation
>>>    in Spark 2.4. Please correct me if I missed something.
>>>    2. Action API should be sufficient for my purpose, thanks for
>>>    pointing out the unit tests showing how it works, but please verify if 
>>> this
>>>    is available in Spark 2.4.
>>>    3. Currently at Netflix we have a custom solution that compacts very
>>>    large partitions in an incremental manner via small batches; with the 
>>> batch
>>>    size being configured outside of the spark job doing the actual merge. 
>>> This
>>>    gives us more control over the resource consumption by the spark jobs. 
>>> Over
>>>    time we would like to migrate over to using the Actions API instead; but
>>>    having the batching be completely controlled internally by the spark job
>>>    may not work out. I can look at if subpartition level filtering would 
>>> take
>>>    care of this issue; but I doubt it will give us the granular control we
>>>    need. Perhaps having an option like the max number of files (or bytes)
>>>    to process would be better.
>>>    4. I am not sure if snapshot expiry will  by itself automatically
>>>    garbage collect the unnecessary delete files. For that to happen, I think
>>>    an explicit DELETE commit of the delete files needs to happen first; by 
>>> an
>>>    action that verifies that the delete files are no longer needed in the
>>>    latest table snapshot. Perhaps there is some work happening to develop 
>>> such
>>>    an action? I would love to look at any pending PRs for that effort.
>>>
>>> Thanks,
>>> - Puneet
>>>
>>> On Mon, Dec 6, 2021 at 9:34 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Dec 6, 2021, at 9:02 PM, Puneet Zaroo <pza...@netflix.com.INVALID>
>>>> wrote:
>>>>
>>>> Hi,
>>>> I had a few questions related to compaction support, in particular
>>>> compaction for CDC destination iceberg tables. Perhaps this information is
>>>> available somewhere else, but I could not find it readily, so responses
>>>> appreciated.
>>>>
>>>>    1. I believe compaction for the CDC use case will require iceberg
>>>>    version >= 0.13 (to pick up the change that maintains the same sequence
>>>>    numbers after compaction) and Spark version >= 3.0 (for the actual
>>>>    compaction action support). But please correct me if I'm wrong.
>>>>
>>>> This isn't strictly necessary but practically it may be depending on
>>>> your CDC ingestion pace. Spark 2.4 contains an older implementation of the
>>>> Compaction code which doesn't have the same feature set but can be used to
>>>> compact datafiles.
>>>>
>>>>
>>>>    1. How can the compaction action (via Spark) actually be triggered?
>>>>    Is it possible to  specify filter predicate as well as the size and 
>>>> number
>>>>    of delete file thresholds for the compaction strategy via SQL 
>>>> statements or
>>>>    does one have to use the XXXRewriteDataFilesSparkAction classes directly
>>>>    from within a spark jar.
>>>>
>>>> There are two methods via Spark.
>>>> 1. The Action Api
>>>>
>>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
>>>>  //
>>>> See examples in the test file here with all parameters being set as well
>>>> 2. The SQL API (just included in master)
>>>>
>>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java
>>>>
>>>>
>>>>    1. As far as I could understand from reading the code, the rewrite
>>>>    action processes all the data that matches a filter predicate (most 
>>>> likely
>>>>    a partition in practice). Internally the whole matched data is broken 
>>>> into
>>>>    smaller chunks which are processed concurrently. Any thoughts on 
>>>> setting a
>>>>    limit on the amount of work being done by the whole operation. I am 
>>>> worried
>>>>    about really large partitions where even though the whole operation is
>>>>    broken into chunks; it will take a long time to finish.
>>>>
>>>> Filters are usually the best way to limit the total size of the
>>>> operation. Additionally we have the concept of "partial_progress" which
>>>> allows the Rewrite to commit as it goes, rather than all at once at the
>>>> end. This means you can terminate a job and still make progress.
>>>>
>>>>
>>>> https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L33-L48
>>>>
>>>>
>>>>    1. The regular compaction will remove the need for equality and
>>>>    position delete files, but those files will still be around. Is there a
>>>>    separate compaction action being planned to actually remove the equality
>>>>    and position delete files?
>>>>
>>>>
>>>> This is in progress, please check the dev list archives and Slack for
>>>> more infoformation.
>>>>
>>>> Thanks,
>>>> - Puneet
>>>>
>>>>
>>>> Most of the delete work is still in progress and we are always looking
>>>> for reviewers and developers to help out, so make sure to keep an eye on
>>>> github.
>>>>
>>>> Russ
>>>>
>>>>

Re: Some questions related to compaction support.

Reply via email to