[ 
https://issues.apache.org/jira/browse/IMPALA-13087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18038914#comment-18038914
 ] 

Zoltán Borók-Nagy commented on IMPALA-13087:
--------------------------------------------

[https://github.com/apache/iceberg/pull/12861] improves RowDelta so we can 
impement add position delete files and complete file removals in a single 
snapshot.

> DML operations on Iceberg tables should not write positition delete files for 
> data files that are completely removed
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-13087
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13087
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Zoltán Borók-Nagy
>            Priority: Minor
>              Labels: impala-iceberg
>
> Users sometimes use the DELETE operation even in cases when they should use 
> TRUNCATE or DROP PARTITION. In those cases the DELETE will write way too many 
> position delete records that will hurt performance of subsequent queries. On 
> top of that, these delete records are unnecessary, because we should just 
> remove the corresponding data files from the new snapshot.
> We need to smart up the DML operations to only write position delete records 
> if they don't delete whole files. The IcebergBufferedDeleteSink has the 
> FilePositions type: 
> https://github.com/apache/impala/blob/bbfba13ed4d084681b542d7c5e1b5156576a603b/be/src/exec/iceberg-buffered-delete-sink.h#L66
> It is a mapping from data files to the positions we are about to delete. 
> After SortBufferedRecords() the positions are in order and there are no 
> duplications. Therefore if the pos_vector.back() == pos_vector.size() - 1, we 
> now we are about to delete a continuous range from 0 to N. At this point we 
> need to look up the number of records in the corresponding data file, and if 
> the number of records are N, we know we are about to delete a whole file.
> In this case we shouldn't write the position delete records, but instead 
> register the data file in dml_exec_state_ for deletion.
> Then in the IcebergCatalogOpExecutor we should use Iceberg's DeleteFiles to 
> remove the registered data files in the same Iceberg transaction with the 
> RowDelta operation.
> The DELETE statement is the most critical here, but UPDATE and MERGE might 
> also benefit from this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to