Re: Some questions related to compaction support.

Ajantha Bhat Mon, 06 Dec 2021 21:22:55 -0800

>
>
>    1. I believe compaction for the CDC use case will require iceberg
>    version >= 0.13 (to pick up the change that maintains the same sequence
>    numbers after compaction) and Spark version >= 3.0 (for the actual
>    compaction action support). But please correct me if I'm wrong.
>
> *yes. PR#3480 <https://github.com/apache/iceberg/pull/3480> solves CDC use
cases, so from 0.13 version we will have it and spark 2.4 also supports
rewrite data files spark action
<https://github.com/apache/iceberg/blob/master/spark/v2.4/spark/src/main/java/org/apache/iceberg/spark/actions/BaseRewriteDataFilesSparkAction.java>
*

2. How can the compaction action (via Spark) actually be triggered? Is it
> possible to  specify filter predicate as well as the size and number of
> delete file thresholds for the compaction strategy via SQL statements or
> does one have to use the XXXRewriteDataFilesSparkAction classes directly
> from within a spark jar.

*In the 0.13 version, I have recently added SQL support for
rewrite_data_files call procedure PR#3375
<https://github.com/apache/iceberg/pull/3375> for spark-3.2.  Back porting
to spark-3.1 and spark-3.0 is also in progress. However, spark-2.4 doesn't
support any call procedures. So, this SQL support also won't be available
for spark-2.4*

3. As far as I could understand from reading the code, the rewrite action
> processes all the data that matches a filter predicate (most likely a
> partition in practice). Internally the whole matched data is broken into
> smaller chunks which are processed concurrently. Any thoughts on setting a
> limit on the amount of work being done by the whole operation. I am worried
> about really large partitions where even though the whole operation is
> broken into chunks; it will take a long time to finish.

*File group is the smallest unit of operation here and I think we have a
lot of control over how to achieve compaction within partition. We have
properties for "partial-progress.enabled" and "max-file-group-size-bytes"
and
"max-concurrent-file-group-rewrites".https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java
<https://github.com/apache/iceberg/blob/master/api/src/main/java/org/apache/iceberg/actions/RewriteDataFiles.java>But
a good document with scenario examples will definitely help to understand
these better. We need to add it.*

4. The regular compaction will remove the need for equality and position
> delete files, but those files will still be around. Is there a separate
> compaction action being planned to actually remove the equality and
> position delete files?

*I think "ExpireSnaphots" spark action or call procedure can be used to
remove the delete files which are redundant after the compaction. It is
similar to how we clean the old data files which are compacted. *

Thanks,
Ajantha

On Tue, Dec 7, 2021 at 8:33 AM Puneet Zaroo <pza...@netflix.com.invalid>
wrote:

> Hi,
> I had a few questions related to compaction support, in particular
> compaction for CDC destination iceberg tables. Perhaps this information is
> available somewhere else, but I could not find it readily, so responses
> appreciated.
>
>    1. I believe compaction for the CDC use case will require iceberg
>    version >= 0.13 (to pick up the change that maintains the same sequence
>    numbers after compaction) and Spark version >= 3.0 (for the actual
>    compaction action support). But please correct me if I'm wrong.
>    2. How can the compaction action (via Spark) actually be triggered? Is
>    it possible to  specify filter predicate as well as the size and number of
>    delete file thresholds for the compaction strategy via SQL statements or
>    does one have to use the XXXRewriteDataFilesSparkAction classes directly
>    from within a spark jar.
>    3. As far as I could understand from reading the code, the rewrite
>    action processes all the data that matches a filter predicate (most likely
>    a partition in practice). Internally the whole matched data is broken into
>    smaller chunks which are processed concurrently. Any thoughts on setting a
>    limit on the amount of work being done by the whole operation. I am worried
>    about really large partitions where even though the whole operation is
>    broken into chunks; it will take a long time to finish.
>    4. The regular compaction will remove the need for equality and
>    position delete files, but those files will still be around. Is there a
>    separate compaction action being planned to actually remove the equality
>    and position delete files?
>
> Thanks,
> - Puneet
>
>

Re: Some questions related to compaction support.

Reply via email to