Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21423 )

Change subject: IMPALA-12732: Add support for MERGE statements for Iceberg 
tables
......................................................................


Patch Set 11:

(20 comments)

A concurrent test and some larger scale would be nice. Otherwise the change 
looks great!

http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeCase.java
File fe/src/main/java/org/apache/impala/analysis/MergeCase.java:

http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeCase.java@89
PS11, Line 89:   protected SlotRef createSlotRef(Analyzer analyzer, String 
colName)
             :       throws AnalysisException {
             :     List<String> path = 
Path.createRawPath(targetTableRef_.getUniqueAlias(), colName);
             :     SlotRef ref = new SlotRef(path);
             :     ref.analyze(analyzer);
             :     return ref;
             :   }
We also have such a method in IcebergModifyImpl. Can we put it to some common 
place?


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeCase.java@115
PS11, Line 115: return (MergeCase) super.clone();
StatementBase.clone() throws NotImplementedException. I think it'd be cleaner 
to throw it here as well or provide a proper implementation that copies the 
members. We usually implement this with the help of a copy ctor, see e.g. 
InsertStmt.


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeDelete.java
File fe/src/main/java/org/apache/impala/analysis/MergeDelete.java:

http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeDelete.java@49
PS11, Line 49:     for (Column col : targetTableColumns_) {
             :       resultExprs_.add(createSlotRef(analyzer, col.getName()));
             :     }
Could you please add a comment why we need all the target table columns, and 
not only the positition delete columns?


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java
File fe/src/main/java/org/apache/impala/analysis/MergeInsert.java:

http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@125
PS11, Line 125:     // Comparing by reference
              :     boolean emptyColumnPermutation = columnPermutation_ == 
Collections.EMPTY_LIST;
Why do you compare by reference? columnPermutation_.isEmpty() feels safer and 
not too inefficient either.


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@188
PS11, Line 188: columnNameFrequency
It doesn't store the frequency as it is not a map. So how about just 
'columnNames'?


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@188
PS11, Line 188: // name
Unnecessary comment: // name


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java
File fe/src/main/java/org/apache/impala/analysis/MergeStmt.java:

http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@34
PS11, Line 34: /
nit: you could add a space before this


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@36
PS11, Line 36: additional filter expression
nit: "an additional filter expression" / "additional filter expressions"


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@69
PS11, Line 69:   
nit: 4 spaces are enough I think. That will be also consistent with L83


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@80
PS11, Line 80: 
setMaxTableSinks(analyzer_.getQueryOptions().getMax_fs_writers());
Does it work? (I know it doesn't work for UPDATE e.g.)
If it works, we should have tests for this. If it doesn't, let's just create 
Jira to track this.


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@101
PS11, Line 101: super.collectTableRefs(tblRefs);
nit: I think it's good practice to call super methods first, this way we can 
guarantee they are being invoked (e.g. imagine you add a return statement later 
to the method body)

Probably the only exceptions to this rule are the close()/free() and such 
methods where we need to free the derived class's resources first.


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeUpdate.java
File fe/src/main/java/org/apache/impala/analysis/MergeUpdate.java:

http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeUpdate.java@98
PS11, Line 98:   private SlotRef disambiguateLhs(SlotRef lhs) throws 
AnalysisException {
Please add comment to this method


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java
File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java:

http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@160
PS11, Line 160: {
nit: missing space


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java:

http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@121
PS11, Line 121:     if (getChild(0).cardinality_ == -1) {
              :       cardinality_ = -1;
              :     } else {
              :       cardinality_ = 
applyConjunctsSelectivity(getChild(0).cardinality_);
              :       Preconditions.checkState(cardinality_ >= 0);
              :     }
If we have both MATCHED and NOT MATCHED claues without any filters we can 
assume that the cardinality is the same as the child's cardinality, right?


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/service/Frontend.java
File fe/src/main/java/org/apache/impala/service/Frontend.java:

http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/service/Frontend.java@2726
PS11, Line 2726: {
nit: missing space


http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
File fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java:

http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java@432
PS11, Line 432:     List<ByteBuffer> dataFilesFb = 
icebergOp.getIceberg_data_files_fb();
              :     List<ByteBuffer> deleteFilesFb = 
icebergOp.getIceberg_delete_files_fb();
              :     if (deleteFilesFb.isEmpty() && !dataFilesFb.isEmpty()) {
              :       appendFiles(feIcebergTable, txn, icebergOp);
              :       return;
              :     }
              :     if (!deleteFilesFb.isEmpty() && dataFilesFb.isEmpty()) {
              :       deleteRows(feIcebergTable, txn, icebergOp);
              :       return;
              :     }
Because of the validation checks I think it's safer to always invoke 
updateRows() which always uses RowDelta.

A plain INSERT just adds new data and doesn't care about what's in the table. 
But a MERGE statement can behave differently based on the contents of the 
table, so concurrent modifications can be problematic. E.g. two concurrent 
MERGE statements with MATCHED and NOT MATCHED clauses.

Concurrent tests under tests/stress would be nice, see e.g. 
tests/stress/test_udpate_stress.py


http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-merge.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/iceberg-merge.test:

http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-merge.test@219
PS11, Line 219: target.action
> We discussed it offline: we shouldn't shuffle the rows based on 'target.act
My mistake, target.action is a column of the table that is included in the 
partition spec, it is not the merge action.


http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-merge.test@317
PS11, Line 317: functional_parquet.iceberg_non_partitioned source
Could you please add tests where SOURCE is an inline view?


http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-merge.test@418
PS11, Line 418: ====
Could you please add tests for tables with partition evolution, e.g.

* iceberg_partition_evolution
* iceberg_v2_delete_equality_partition_evolution


http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-query/queries/QueryTest/iceberg-merge.test
File testdata/workloads/functional-query/queries/QueryTest/iceberg-merge.test:

http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-query/queries/QueryTest/iceberg-merge.test@11
PS11, Line 11:   stored as iceberg tblproperties("format-version"="2");
You could also modify the partition layout of these tables at some point.



--
To view, visit http://gerrit.cloudera.org:8080/21423
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3416a79740eddc446c87f72bf1a85ed3f71af268
Gerrit-Change-Number: 21423
Gerrit-PatchSet: 11
Gerrit-Owner: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 05 Jul 2024 15:45:50 +0000
Gerrit-HasComments: Yes

Reply via email to