Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/21423 )
Change subject: IMPALA-12732: Add support for MERGE statements for Iceberg tables ...................................................................... Patch Set 11: (20 comments) A concurrent test and some larger scale would be nice. Otherwise the change looks great! http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeCase.java File fe/src/main/java/org/apache/impala/analysis/MergeCase.java: http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeCase.java@89 PS11, Line 89: protected SlotRef createSlotRef(Analyzer analyzer, String colName) : throws AnalysisException { : List<String> path = Path.createRawPath(targetTableRef_.getUniqueAlias(), colName); : SlotRef ref = new SlotRef(path); : ref.analyze(analyzer); : return ref; : } We also have such a method in IcebergModifyImpl. Can we put it to some common place? http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeCase.java@115 PS11, Line 115: return (MergeCase) super.clone(); StatementBase.clone() throws NotImplementedException. I think it'd be cleaner to throw it here as well or provide a proper implementation that copies the members. We usually implement this with the help of a copy ctor, see e.g. InsertStmt. http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeDelete.java File fe/src/main/java/org/apache/impala/analysis/MergeDelete.java: http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeDelete.java@49 PS11, Line 49: for (Column col : targetTableColumns_) { : resultExprs_.add(createSlotRef(analyzer, col.getName())); : } Could you please add a comment why we need all the target table columns, and not only the positition delete columns? http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java File fe/src/main/java/org/apache/impala/analysis/MergeInsert.java: http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@125 PS11, Line 125: // Comparing by reference : boolean emptyColumnPermutation = columnPermutation_ == Collections.EMPTY_LIST; Why do you compare by reference? columnPermutation_.isEmpty() feels safer and not too inefficient either. http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@188 PS11, Line 188: columnNameFrequency It doesn't store the frequency as it is not a map. So how about just 'columnNames'? http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@188 PS11, Line 188: // name Unnecessary comment: // name http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java File fe/src/main/java/org/apache/impala/analysis/MergeStmt.java: http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@34 PS11, Line 34: / nit: you could add a space before this http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@36 PS11, Line 36: additional filter expression nit: "an additional filter expression" / "additional filter expressions" http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@69 PS11, Line 69: nit: 4 spaces are enough I think. That will be also consistent with L83 http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@80 PS11, Line 80: setMaxTableSinks(analyzer_.getQueryOptions().getMax_fs_writers()); Does it work? (I know it doesn't work for UPDATE e.g.) If it works, we should have tests for this. If it doesn't, let's just create Jira to track this. http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@101 PS11, Line 101: super.collectTableRefs(tblRefs); nit: I think it's good practice to call super methods first, this way we can guarantee they are being invoked (e.g. imagine you add a return statement later to the method body) Probably the only exceptions to this rule are the close()/free() and such methods where we need to free the derived class's resources first. http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeUpdate.java File fe/src/main/java/org/apache/impala/analysis/MergeUpdate.java: http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/analysis/MergeUpdate.java@98 PS11, Line 98: private SlotRef disambiguateLhs(SlotRef lhs) throws AnalysisException { Please add comment to this method http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java File fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java: http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/planner/DistributedPlanner.java@160 PS11, Line 160: { nit: missing space http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java File fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java: http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@121 PS11, Line 121: if (getChild(0).cardinality_ == -1) { : cardinality_ = -1; : } else { : cardinality_ = applyConjunctsSelectivity(getChild(0).cardinality_); : Preconditions.checkState(cardinality_ >= 0); : } If we have both MATCHED and NOT MATCHED claues without any filters we can assume that the cardinality is the same as the child's cardinality, right? http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/service/Frontend.java File fe/src/main/java/org/apache/impala/service/Frontend.java: http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/service/Frontend.java@2726 PS11, Line 2726: { nit: missing space http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java File fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java: http://gerrit.cloudera.org:8080/#/c/21423/11/fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java@432 PS11, Line 432: List<ByteBuffer> dataFilesFb = icebergOp.getIceberg_data_files_fb(); : List<ByteBuffer> deleteFilesFb = icebergOp.getIceberg_delete_files_fb(); : if (deleteFilesFb.isEmpty() && !dataFilesFb.isEmpty()) { : appendFiles(feIcebergTable, txn, icebergOp); : return; : } : if (!deleteFilesFb.isEmpty() && dataFilesFb.isEmpty()) { : deleteRows(feIcebergTable, txn, icebergOp); : return; : } Because of the validation checks I think it's safer to always invoke updateRows() which always uses RowDelta. A plain INSERT just adds new data and doesn't care about what's in the table. But a MERGE statement can behave differently based on the contents of the table, so concurrent modifications can be problematic. E.g. two concurrent MERGE statements with MATCHED and NOT MATCHED clauses. Concurrent tests under tests/stress would be nice, see e.g. tests/stress/test_udpate_stress.py http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-merge.test File testdata/workloads/functional-planner/queries/PlannerTest/iceberg-merge.test: http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-merge.test@219 PS11, Line 219: target.action > We discussed it offline: we shouldn't shuffle the rows based on 'target.act My mistake, target.action is a column of the table that is included in the partition spec, it is not the merge action. http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-merge.test@317 PS11, Line 317: functional_parquet.iceberg_non_partitioned source Could you please add tests where SOURCE is an inline view? http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-planner/queries/PlannerTest/iceberg-merge.test@418 PS11, Line 418: ==== Could you please add tests for tables with partition evolution, e.g. * iceberg_partition_evolution * iceberg_v2_delete_equality_partition_evolution http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-query/queries/QueryTest/iceberg-merge.test File testdata/workloads/functional-query/queries/QueryTest/iceberg-merge.test: http://gerrit.cloudera.org:8080/#/c/21423/11/testdata/workloads/functional-query/queries/QueryTest/iceberg-merge.test@11 PS11, Line 11: stored as iceberg tblproperties("format-version"="2"); You could also modify the partition layout of these tables at some point. -- To view, visit http://gerrit.cloudera.org:8080/21423 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3416a79740eddc446c87f72bf1a85ed3f71af268 Gerrit-Change-Number: 21423 Gerrit-PatchSet: 11 Gerrit-Owner: Peter Rozsa <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 05 Jul 2024 15:45:50 +0000 Gerrit-HasComments: Yes
