Re: Some questions related to compaction support.

Ajantha Bhat Tue, 07 Dec 2021 22:34:32 -0800

> I just raised a PR to fix it [https://github.com/apache/iceberg/pull/3685/
]


It seems it is not straight forward. Will have discussions with Russell and
others in the PR and conclude.

Thanks,
Ajantha

On Wed, Dec 8, 2021 at 11:43 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:

>
>>    1. Spark 2.4 should also have support via the direct action API for
>>    compaction (and the action API should be sufficient for me); but the class
>>    pointed out
>>    
>> <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java>
>>  seems
>>    to be an abstract class and I could not find an actual implementation in
>>    Spark 2.4. Please correct me if I missed something.
>>
>>
>>    1. Action API should be sufficient for my purpose, thanks for
>>    pointing out the unit tests showing how it works, but please verify if 
>> this
>>    is available in Spark 2.4.
>>
>> I have checked this, It seems the deprecated Actions was having an
> implementation for rewrite data files.
> But it's new version SparkActions does not have the implementation for
> rewrite data files.
>
> Deprecated class was removed as per this PR [
> https://github.com/apache/iceberg/pull/3587].
> I am not sure why the class was deprecated without having an alternate
> implementation.
> I just raised a PR to fix it [https://github.com/apache/iceberg/pull/3685/
> ]
>
>
> Thanks,
> Ajantha
>
> On Wed, Dec 8, 2021 at 1:06 AM Puneet Zaroo <pza...@netflix.com.invalid>
> wrote:
>
>> Ajantha, Jack and Russell,
>> Thanks for the prompt replies. Just consolidating the information, my
>> understanding is:
>>
>>    1. Spark 2.4 should also have support via the direct action API for
>>    compaction (and the action API should be sufficient for me); but the class
>>    pointed out
>>    
>> <https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java>
>>    seems to be an abstract class and I could not find an actual 
>> implementation
>>    in Spark 2.4. Please correct me if I missed something.
>>    2. Action API should be sufficient for my purpose, thanks for
>>    pointing out the unit tests showing how it works, but please verify if 
>> this
>>    is available in Spark 2.4.
>>    3. Currently at Netflix we have a custom solution that compacts very
>>    large partitions in an incremental manner via small batches; with the 
>> batch
>>    size being configured outside of the spark job doing the actual merge. 
>> This
>>    gives us more control over the resource consumption by the spark jobs. 
>> Over
>>    time we would like to migrate over to using the Actions API instead; but
>>    having the batching be completely controlled internally by the spark job
>>    may not work out. I can look at if subpartition level filtering would take
>>    care of this issue; but I doubt it will give us the granular control we
>>    need. Perhaps having an option like the max number of files (or bytes)
>>    to process would be better.
>>    4. I am not sure if snapshot expiry will  by itself automatically
>>    garbage collect the unnecessary delete files. For that to happen, I think
>>    an explicit DELETE commit of the delete files needs to happen first; by an
>>    action that verifies that the delete files are no longer needed in the
>>    latest table snapshot. Perhaps there is some work happening to develop 
>> such
>>    an action? I would love to look at any pending PRs for that effort.
>>
>> Thanks,
>> - Puneet
>>
>> On Mon, Dec 6, 2021 at 9:34 PM Russell Spitzer <russell.spit...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Dec 6, 2021, at 9:02 PM, Puneet Zaroo <pza...@netflix.com.INVALID>
>>> wrote:
>>>
>>> Hi,
>>> I had a few questions related to compaction support, in particular
>>> compaction for CDC destination iceberg tables. Perhaps this information is
>>> available somewhere else, but I could not find it readily, so responses
>>> appreciated.
>>>
>>>    1. I believe compaction for the CDC use case will require iceberg
>>>    version >= 0.13 (to pick up the change that maintains the same sequence
>>>    numbers after compaction) and Spark version >= 3.0 (for the actual
>>>    compaction action support). But please correct me if I'm wrong.
>>>
>>> This isn't strictly necessary but practically it may be depending on
>>> your CDC ingestion pace. Spark 2.4 contains an older implementation of the
>>> Compaction code which doesn't have the same feature set but can be used to
>>> compact datafiles.
>>>
>>>
>>>    1. How can the compaction action (via Spark) actually be triggered?
>>>    Is it possible to  specify filter predicate as well as the size and 
>>> number
>>>    of delete file thresholds for the compaction strategy via SQL statements 
>>> or
>>>    does one have to use the XXXRewriteDataFilesSparkAction classes directly
>>>    from within a spark jar.
>>>
>>> There are two methods via Spark.
>>> 1. The Action Api
>>>
>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
>>>  //
>>> See examples in the test file here with all parameters being set as well
>>> 2. The SQL API (just included in master)
>>>
>>> https://github.com/apache/iceberg/blob/master/spark/v3.2/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRewriteDataFilesProcedure.java
>>>
>>>
>>>    1. As far as I could understand from reading the code, the rewrite
>>>    action processes all the data that matches a filter predicate (most 
>>> likely
>>>    a partition in practice). Internally the whole matched data is broken 
>>> into
>>>    smaller chunks which are processed concurrently. Any thoughts on setting 
>>> a
>>>    limit on the amount of work being done by the whole operation. I am 
>>> worried
>>>    about really large partitions where even though the whole operation is
>>>    broken into chunks; it will take a long time to finish.
>>>
>>> Filters are usually the best way to limit the total size of the
>>> operation. Additionally we have the concept of "partial_progress" which
>>> allows the Rewrite to commit as it goes, rather than all at once at the
>>> end. This means you can terminate a job and still make progress.
>>>
>>>
>>> https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java#L33-L48
>>>
>>>
>>>    1. The regular compaction will remove the need for equality and
>>>    position delete files, but those files will still be around. Is there a
>>>    separate compaction action being planned to actually remove the equality
>>>    and position delete files?
>>>
>>>
>>> This is in progress, please check the dev list archives and Slack for
>>> more infoformation.
>>>
>>> Thanks,
>>> - Puneet
>>>
>>>
>>> Most of the delete work is still in progress and we are always looking
>>> for reviewers and developers to help out, so make sure to keep an eye on
>>> github.
>>>
>>> Russ
>>>
>>>

Re: Some questions related to compaction support.

Reply via email to