Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/21423 )
Change subject: IMPALA-12732: Add support for MERGE statements for Iceberg tables ...................................................................... Patch Set 7: (25 comments) I managed to finish the first iteration on the FE changes plus the tests. Thanks for this huge review! I'll continue with the BE changes. http://gerrit.cloudera.org:8080/#/c/21423/7/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/21423/7/common/thrift/PlanNodes.thrift@738 PS7, Line 738: 2: optional list<Exprs.TExpr> filter_conjuncts Just thinking out load here: Can't we represent the merge-case as a filter conjunct here? e.g. WHEN MATCHED could be 'row_present == 0'. This could be constructed in analysis time if I don't miss anything, and then the 'filter_conjuncts' would be required here. And then we might be able to remove the TMergeCaseType too from this struct. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java File fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java: http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@192 PS7, Line 192: List<Expr> positionMetaExprs = Expr.cloneList(getPositionMetaExprs()); I'm not sure I get why we have to clone the cases and expressions here. Can't we simply pass them as they are? I probably miss something, could you explain? http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@195 PS7, Line 195: parent, I find it confusing that a variable names 'parent' is passed to this constructor as a parameter named 'child' on the other side. I might miss something here, though. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@204 PS7, Line 204: createDataSink similarly to getPlanNode, I don't think creating stuff related to the planner, like data sinks, should be placed into analyzer classes. Update, I see this is needed because MergeStmt has to implement it as an override from DmlStatementBase. Well, I don't like it but then let's keep it :) http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@215 PS7, Line 215: TableSink deleteSink = Not all use cases require an insert/delete sink. Is it necessary to create both of them undoncitionally? http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java File fe/src/main/java/org/apache/impala/analysis/MergeInsert.java: http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@117 PS7, Line 117: analyzedColumnPermutation I think the name suggests the opposite than what this is for. 'columnsToBeAnalyzed' or something similar would explain the purpose better. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@119 PS7, Line 119: Collectors.toMap(s -> s, o -> columnPermutation_.indexOf(o))); For me it seems an overkill to pre-collect the indexes for the column names. Doesn't seem to increase performance but adds extra code complexity. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@122 PS7, Line 122: boolean emptyColumnPermutation = Not sure about this so just asking: For the case of no column permutations provided wouldn't the code be more clear if we did this: - do a check if column permutations are provided - uncoditionally go through column of the table and populate 'matches' - return and then do the rest without the IMPLICIT case below. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@138 PS7, Line 138: null Not sure why you don't use the name of the col for the IMPLICIT case. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@143 PS7, Line 143: null)); nit: this could fit into line above http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@183 PS7, Line 183: if (columnPermutationSize > selectListSize) { For me logically the length check could more naturally fit right after the duplicate check. Here you use the size of 'columnMatches' for the IMPLICIT case, however you could use the number of cols instead if you moved this code above the population of 'columnMatches'. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@186 PS7, Line 186: target, below you put these 3 params into a single line, but here you break them into separate lines. Pls keep it consistent. I'd prefer the more compact version. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@198 PS7, Line 198: for (ColumnMatch match : columnMatches.list) { I think this population of field could be merged into the for loop on columns at L127 if we did the column/permutation/select list length checks before that. If my assumption stands true, than we can drop the 'listingOrder' col from Matches because we could use this information right away and won't further need to store it. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@223 PS7, Line 223: } When I got to the end of this collate function, I find that since some of the code could be moved and merged with code in analyzeColumnPermutation, is there anything left in this function that justifies its existence. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@230 PS7, Line 230: listingOrder This is not an order but an index http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@231 PS7, Line 231: private Expr sourceExpr; I think we can make this final too and set it when all the other members via the constructor. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@264 PS7, Line 264: IMPLICITLY_MATCHED IMPLICITLY_MATCHED doesn't seem to fit this design for me. Either all of the matches are IMPLICIT or none of them are, because it depends on if the user has provided a column permutation or not. I'd rather move that out from MatchType and store a bool at 'ColumnMatches' level. I'd also give it some thoughts if it can be entirely dropped because it seems to be used only to fine-tune some error messages. Once that is removed, then MatchType could also be replaced by a bool 'isMatched'_. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java File fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java: http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@37 PS7, Line 37: public class IcebergMergeNode extends PlanNode { comments pls http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@57 PS7, Line 57: addChild(child); Can we assume that the child node is always some kind of Join? If yes, can we do a Precondition on this? http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@64 PS7, Line 64: cases_.stream() nit: this line could be merged with the one above and this saves us one level of indentation too. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@94 PS7, Line 94: analyzer, true); nit: fits into line above. Also see L97 http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@107 PS7, Line 107: tupleIds_.add(mergeActionTuple_.getId()); ni: move this closer to the part where we put other stuff into 'tupleIds'. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeSink.java File fe/src/main/java/org/apache/impala/planner/IcebergMergeSink.java: http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeSink.java@27 PS7, Line 27: extends DataSink Isn't this a kind of MultiDataSink? http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeSink.java@27 PS7, Line 27: public class IcebergMergeSink extends DataSink { comments pls http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/Planner.java File fe/src/main/java/org/apache/impala/planner/Planner.java: http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/Planner.java@1032 PS7, Line 1032: return merge.getPlanNode(ctx_, planNode, analyzer); As commented on MergeStmt, an analyzer class shouldn't be responsible of creating a plan node. It's the responsibility of the planner. -- To view, visit http://gerrit.cloudera.org:8080/21423 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3416a79740eddc446c87f72bf1a85ed3f71af268 Gerrit-Change-Number: 21423 Gerrit-PatchSet: 7 Gerrit-Owner: Peter Rozsa <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Tue, 18 Jun 2024 08:50:09 +0000 Gerrit-HasComments: Yes
