Re: Some questions related to compaction support.

Jack Ye Mon, 06 Dec 2021 21:32:29 -0800

For clarification, Ajantha is correct about 4, I just mean we can remove
delete files more eagerly using an additional procedure, but normal
snapshot expiration still works.
-Jack


On Mon, Dec 6, 2021 at 9:22 PM Jack Ye <yezhao...@gmail.com> wrote:

> 1. Yes, you are correct.
> 2. We just added the SQL procedure call, if you don't want to directly
> invoke the action via Spark:
> https://github.com/apache/iceberg/blob/master/site/docs/spark-procedures.md?plain=1#L243
> 3. The filter is a data filter, it does not need to be at partition
> boundary, you can refine the filter based on your compute resource
> 4. The position and equality delete files are no longer a part of the read
> path so they should not have much impact on performance, but we do have
> plans to improve our snapshot expiration procedure (likely by adding
> another procedure dedicated to removing deletes) to clean up those files.
>
> -Jack
>
> On Mon, Dec 6, 2021 at 7:06 PM Puneet Zaroo <pza...@netflix.com.invalid>
> wrote:
>
>> Hi,
>> I had a few questions related to compaction support, in particular
>> compaction for CDC destination iceberg tables. Perhaps this information is
>> available somewhere else, but I could not find it readily, so responses
>> appreciated.
>>
>>    1. I believe compaction for the CDC use case will require iceberg
>>    version >= 0.13 (to pick up the change that maintains the same sequence
>>    numbers after compaction) and Spark version >= 3.0 (for the actual
>>    compaction action support). But please correct me if I'm wrong.
>>    2. How can the compaction action (via Spark) actually be triggered?
>>    Is it possible to  specify filter predicate as well as the size and number
>>    of delete file thresholds for the compaction strategy via SQL statements 
>> or
>>    does one have to use the XXXRewriteDataFilesSparkAction classes directly
>>    from within a spark jar.
>>    3. As far as I could understand from reading the code, the rewrite
>>    action processes all the data that matches a filter predicate (most likely
>>    a partition in practice). Internally the whole matched data is broken into
>>    smaller chunks which are processed concurrently. Any thoughts on setting a
>>    limit on the amount of work being done by the whole operation. I am 
>> worried
>>    about really large partitions where even though the whole operation is
>>    broken into chunks; it will take a long time to finish.
>>    4. The regular compaction will remove the need for equality and
>>    position delete files, but those files will still be around. Is there a
>>    separate compaction action being planned to actually remove the equality
>>    and position delete files?
>>
>> Thanks,
>> - Puneet
>>
>>

Re: Some questions related to compaction support.

Reply via email to