Zoltan,

The warning is that dynamic overwrites in general aren't recommended.
ReplacePartitions is the right operation to use for dynamic overwrite, we
just want to steer users away from dynamic overwrites in general.

The problem with dynamic overwrite is that its behavior depends on the
underlying physical structure of the table. If your table is partitioned by
days, then it will overwrite by day. If it is partitioned by hours, it will
overwrite by hour. That means that the behavior changes when table
partitioning changes. That's a bad thing because we want SQL operations to
be logical operations that do not depend on the table layout.

OverwriteFiles is a better option because what gets overwritten is
explicit. If you want to overwrite a day, you pass a filter for that day.
Another way around this problem is to support MERGE INTO, which will detect
the files that need to be changed and correctly rewrite them, wherever they
are in the table.

rb

On Fri, Jan 29, 2021 at 10:14 AM Zoltán Borók-Nagy
<borokna...@cloudera.com.invalid> wrote:

> Hey everyone,
>
> I'm currently working on the INSERT OVERWRITE statement for Iceberg tables
> in Impala.
>
> Seems like ReplacePartitions is the perfect interface for this job:
> https://github.infra.cloudera.com/CDH/iceberg/blob/cdpd-master/api/src/main/java/org/apache/iceberg/ReplacePartitions.java
>
> IIUC Spark also uses this interface for dynamic overwrites. Though the
> class comment says that this interface is not recommended to use, and use
> OverwriteFiles instead. OverwriteFiles is more generic and more explicit,
> but for this task it would need extra boilerplate code, therefore more
> possibilities to do stg wrong.
>
> So my question is, is there any problem with using ReplacePartitions for
> dynamic overwrites? I see that it only replaces partitions with the current
> partition spec, but that's probably fine. Otherwise handling partition
> layout evolution and dynamic inserts can be complicated and error-prone
> anyway.
>
> Apart from which interface to use, is there anything I should be aware of?
> E.g. I guess we don't want to allow dynamic overwrites if the table is
> partitioned by the BUCKET transform.
>
> Thanks,
>     Zoltan
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to