[
https://issues.apache.org/jira/browse/IMPALA-13087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18038914#comment-18038914
]
Zoltán Borók-Nagy commented on IMPALA-13087:
--------------------------------------------
[https://github.com/apache/iceberg/pull/12861] improves RowDelta so we can
impement add position delete files and complete file removals in a single
snapshot.
> DML operations on Iceberg tables should not write positition delete files for
> data files that are completely removed
> --------------------------------------------------------------------------------------------------------------------
>
> Key: IMPALA-13087
> URL: https://issues.apache.org/jira/browse/IMPALA-13087
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Zoltán Borók-Nagy
> Priority: Minor
> Labels: impala-iceberg
>
> Users sometimes use the DELETE operation even in cases when they should use
> TRUNCATE or DROP PARTITION. In those cases the DELETE will write way too many
> position delete records that will hurt performance of subsequent queries. On
> top of that, these delete records are unnecessary, because we should just
> remove the corresponding data files from the new snapshot.
> We need to smart up the DML operations to only write position delete records
> if they don't delete whole files. The IcebergBufferedDeleteSink has the
> FilePositions type:
> https://github.com/apache/impala/blob/bbfba13ed4d084681b542d7c5e1b5156576a603b/be/src/exec/iceberg-buffered-delete-sink.h#L66
> It is a mapping from data files to the positions we are about to delete.
> After SortBufferedRecords() the positions are in order and there are no
> duplications. Therefore if the pos_vector.back() == pos_vector.size() - 1, we
> now we are about to delete a continuous range from 0 to N. At this point we
> need to look up the number of records in the corresponding data file, and if
> the number of records are N, we know we are about to delete a whole file.
> In this case we shouldn't write the position delete records, but instead
> register the data file in dml_exec_state_ for deletion.
> Then in the IcebergCatalogOpExecutor we should use Iceberg's DeleteFiles to
> remove the registered data files in the same Iceberg transaction with the
> RowDelta operation.
> The DELETE statement is the most critical here, but UPDATE and MERGE might
> also benefit from this.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]