Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/21423 )
Change subject: IMPALA-12732: Add support for MERGE statements for Iceberg tables ...................................................................... Patch Set 7: (25 comments) Let me flush another batch of comments now. I'm somewhere halfway through the analyzer changes. http://gerrit.cloudera.org:8080/#/c/21423/6/fe/src/main/cup/sql-parser.cup File fe/src/main/cup/sql-parser.cup: http://gerrit.cloudera.org:8080/#/c/21423/6/fe/src/main/cup/sql-parser.cup@1027 PS6, Line 1027: List<MergeCase> cases = Lists.newArrayList(merge_case); nit: I think here we can merge these 2 rows http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java File fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java: http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@60 PS7, Line 60: public class IcebergMergeImpl extends MergeImpl { As I wrote in MergeStmt, I'm not sure I'm convinced that we need a separate abstraction for the implementation. One reason is that we don't plan to add any other implementations and even if we did, the coding should hppen the other way around: when we add the other Merge impl then as a separate commit we could do the extra level of abstracion then as a refactor. The other thing I find myself uncomfortable with is that this IcebergMergeImpl now uses all the inner members of MergeStmt, even the MergeStmt itself is passed to the constructor. For me this asks for Inheritance rather than an impl member. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@93 PS7, Line 93: if (mergeStmt_.isOnlyMatchedCases()) { : sourceTableRef_.setJoinOp(JoinOperator.INNER_JOIN); : } else { : sourceTableRef_.setJoinOp(JoinOperator.FULL_OUTER_JOIN); : } : sourceTableRef_.setOnClause(on_); : sourceTableRef_.setLeftTblRef(targetTableRef_); This could go into a setJoinParams() function http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@120 PS7, Line 120: icebergPositionalDeleteTable_ = new IcebergPositionDeleteTable( If you have only a WHEN NOT NATCHED THEN INSERT case do you still have to create a pos del table? http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@127 PS7, Line 127: for (MergeCase mergeCase : mergeStmt_.getCases()) { Since all these merge cases live inside MergeStmt, shouldn't this code live in MergeStmt.analyze()? http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@131 PS7, Line 131: analyzer.registerPrivReq( Here we require ALL privs on the target table. Shouldn't we also expect SELECT privs on the source table? http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@138 PS7, Line 138: private void addMergeActionTuple(Analyzer analyzer) { Function comment pls http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@140 PS7, Line 140: analyzer.getDescTbl().createTupleDescriptor(MERGE_ACTION_TUPLE_NAME); What is a merge action and why is a tuple needed for it? Could you comment it on the class? http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@153 PS7, Line 153: public void substituteResultExprs(ExprSubstitutionMap smap, It stinks for me that MergeStmt also has a fn with similar name but here we inherit this not from there but from another interface, MergeImpl. For me this also indicates that this class should be a child of MergeStmt http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@186 PS7, Line 186: public PlanNode getPlanNode(PlannerContext ctx, PlanNode parent, Creating new PlanNodes should be the responsibility of the Planner, in my opinion. An Analysis class in the Analysis package shouldn't know how a corresponding PlanNode is created. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@204 PS7, Line 204: public DataSink createDataSink() { I think this should be an override. Also true to some other functions in this class. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeImpl.java File fe/src/main/java/org/apache/impala/analysis/MergeImpl.java: http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeImpl.java@20 PS7, Line 20: public abstract QueryStmt prepareQuery(); Since this serves as an interface for the Merge implementations (about if we should have such an abstraction see my other comment in MergeStmt) please add a comment for these functions, what the implementations are expected to implement. If we decide to keep this abstraction, I think prepareQuery() shouldn't be part of the interface. The reason I say this is that an interface should be as clean and understandable as possible and for me something like a prepareQuery() violates both. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeImpl.java@24 PS7, Line 24: Analyzer analyzer); nit: indentation http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeImpl.java@32 PS7, Line 32: public TupleDescriptor getMergeActionTuple(){ If we wan't to make this a general abstraction between all other Merge implementations then it's not guaranteed that the other impls will have a MergeActionTuple. Seems to specific to include it in a general interface. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeImpl.java@33 PS7, Line 33: return mergeActionTuple_; this and below could fit into one line http://gerrit.cloudera.org:8080/#/c/21423/6/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java File fe/src/main/java/org/apache/impala/analysis/MergeStmt.java: http://gerrit.cloudera.org:8080/#/c/21423/6/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@38 PS6, Line 38: nit: onClause? http://gerrit.cloudera.org:8080/#/c/21423/6/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@38 PS6, Line 38: impl_; nit: make the name of these 2 variables consistent with each other http://gerrit.cloudera.org:8080/#/c/21423/6/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@78 PS6, Line 78: blRefs.add(sourceTableRef_); > This condition is required to avoid the case where an inline view is provid analyze() already checks for that, right? here it could be a precondition then http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java File fe/src/main/java/org/apache/impala/analysis/MergeStmt.java: http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@32 PS7, Line 32: public class MergeStmt extends DmlStatementBase { I really miss the comments from this class explain the purpose of things. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@38 PS7, Line 38: private MergeImpl impl_; Is this detaching the implementation thing was to follow the architecture of Update? Doesn't have a strong opinion about this but I kind of feel that it is an overkill here sinec we only have an Iceberg implementation and not planning to do others if I'm not mistaken. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@122 PS7, Line 122: return targetTableRef_; Some of these could fit into a single line. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@129 PS7, Line 129: public List<MergeCase> getCases() { nit: add new line above http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@130 PS7, Line 130: return cases_; nit: function fits into single line. http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeStmt.java@133 PS7, Line 133: isOnlyMatchedCases nit: hasMatchedCasesOnly? http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/util/IcebergUtil.java File fe/src/main/java/org/apache/impala/util/IcebergUtil.java: http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/util/IcebergUtil.java@1295 PS7, Line 1295: public static class PartitionExprBundle { I personally don't like this new PartitionExprBundle class. First, the 'Bundle' in the name doesn't mean anything, it could be either PartitionExprStuff, or PartitionExprThings either. If I'm not mistaken this is introduced for a refactor so that populatePartitionExprs() could return a single object instead of having 2 out parameters. Is this right? The original approach was preferrable for me because the caller was forced to provide some params for the desired data, but with the new approach you get a 'Bundle' that you don't know what it contains internally. Not to mention that (in case this is a pure refactor) with a patch of this size including refactors will bring extra noise for the reader. -- To view, visit http://gerrit.cloudera.org:8080/21423 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3416a79740eddc446c87f72bf1a85ed3f71af268 Gerrit-Change-Number: 21423 Gerrit-PatchSet: 7 Gerrit-Owner: Peter Rozsa <[email protected]> Gerrit-Reviewer: Daniel Becker <[email protected]> Gerrit-Reviewer: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]> Gerrit-Reviewer: Peter Rozsa <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 13 Jun 2024 14:17:08 +0000 Gerrit-HasComments: Yes
