[Impala-ASF-CR] IMPALA-12732: Add support for MERGE statements for Iceberg tables

Gabor Kaszab (Code Review) Tue, 18 Jun 2024 02:15:08 -0700

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21423 )


Change subject: IMPALA-12732: Add support for MERGE statements for Iceberg 
tables
......................................................................


Patch Set 7:

(25 comments)

I managed to finish the first iteration on the FE changes plus the tests. 
Thanks for this huge review! I'll continue with the BE changes.

http://gerrit.cloudera.org:8080/#/c/21423/7/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/21423/7/common/thrift/PlanNodes.thrift@738
PS7, Line 738:   2: optional list<Exprs.TExpr> filter_conjuncts
Just thinking out load here:
Can't we represent the merge-case as a filter conjunct here? e.g. WHEN MATCHED 
could be 'row_present == 0'. This could be constructed in analysis time if I 
don't miss anything, and then the 'filter_conjuncts' would be required here. 
And then we might be able to remove the TMergeCaseType too from this struct.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java
File fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java:

http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@192
PS7, Line 192:     List<Expr> positionMetaExprs = 
Expr.cloneList(getPositionMetaExprs());
I'm not sure I get why we have to clone the cases and expressions here. Can't 
we simply pass them as they are? I probably miss something, could you explain?


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@195
PS7, Line 195:         parent,
I find it confusing that a variable names 'parent' is passed to this 
constructor as a parameter named 'child' on the other side. I might miss 
something here, though.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@204
PS7, Line 204: createDataSink
similarly to getPlanNode, I don't think creating stuff related to the planner, 
like data sinks, should be placed into analyzer classes.

Update, I see this is needed because MergeStmt has to implement it as an 
override from DmlStatementBase. Well, I don't like it but then let's keep it :)


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/IcebergMergeImpl.java@215
PS7, Line 215:     TableSink deleteSink =
Not all use cases require an insert/delete sink. Is it necessary to create both 
of them undoncitionally?


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java
File fe/src/main/java/org/apache/impala/analysis/MergeInsert.java:

http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@117
PS7, Line 117: analyzedColumnPermutation
I think the name suggests the opposite than what this is for. 
'columnsToBeAnalyzed' or something similar would explain the purpose better.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@119
PS7, Line 119:             Collectors.toMap(s -> s, o -> 
columnPermutation_.indexOf(o)));
For me it seems an overkill to pre-collect the indexes for the  column names. 
Doesn't seem to increase performance but adds extra code complexity.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@122
PS7, Line 122:     boolean emptyColumnPermutation =
Not sure about this so just asking: For the case of no column permutations 
provided wouldn't the code be more clear if we did this:
 - do a check if column permutations are provided
 - uncoditionally go through column of the table and populate 'matches'
 - return

and then do the rest without the IMPLICIT case below.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@138
PS7, Line 138: null
Not sure why you don't use the name of the col for the IMPLICIT case.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@143
PS7, Line 143:                 null));
nit: this could fit into line above


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@183
PS7, Line 183:     if (columnPermutationSize > selectListSize) {
For me logically the length check could more naturally fit right after the 
duplicate check. Here you use the size of 'columnMatches' for the IMPLICIT 
case, however you could use the number of cols instead if you moved this code 
above the population of 'columnMatches'.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@186
PS7, Line 186:           target,
below you put these 3 params into a single line, but here you break them into 
separate lines. Pls keep it consistent. I'd prefer the more compact version.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@198
PS7, Line 198:     for (ColumnMatch match : columnMatches.list) {
I think this population of field could be merged into the for loop on columns 
at L127 if we did the column/permutation/select list length checks before that.

If my assumption stands true, than we can drop the 'listingOrder' col from 
Matches because we could use this information right away and won't further need 
to store it.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@223
PS7, Line 223:   }
When I got to the end of this collate function, I find that since some of the 
code could be moved and merged with code in analyzeColumnPermutation, is there 
anything left in this function that justifies its existence.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@230
PS7, Line 230: listingOrder
This is not an order but an index


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@231
PS7, Line 231:     private Expr sourceExpr;
I think we can make this final too and set it when all the other members via 
the constructor.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/analysis/MergeInsert.java@264
PS7, Line 264: IMPLICITLY_MATCHED
IMPLICITLY_MATCHED doesn't seem to fit this design for me. Either all of the 
matches are IMPLICIT or none of them are, because it depends on if the user has 
provided a column permutation or not. I'd rather move that out from MatchType 
and store a bool at 'ColumnMatches' level. I'd also give it some thoughts if it 
can be entirely dropped because it seems to be used only to fine-tune some 
error messages.

Once that is removed, then MatchType could also be replaced by a bool 
'isMatched'_.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java
File fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java:

http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@37
PS7, Line 37: public class IcebergMergeNode extends PlanNode {
comments pls


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@57
PS7, Line 57:     addChild(child);
Can we assume that the child node is always some kind of Join? If yes, can we 
do a Precondition on this?


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@64
PS7, Line 64:         cases_.stream()
nit: this line could be merged with the one above and this saves us one level 
of indentation too.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@94
PS7, Line 94:         analyzer, true);
nit: fits into line above.
Also see L97


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeNode.java@107
PS7, Line 107:     tupleIds_.add(mergeActionTuple_.getId());
ni: move this closer to the part where we put other stuff into 'tupleIds'.


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeSink.java
File fe/src/main/java/org/apache/impala/planner/IcebergMergeSink.java:

http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeSink.java@27
PS7, Line 27: extends DataSink
Isn't this a kind of MultiDataSink?


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/IcebergMergeSink.java@27
PS7, Line 27: public class IcebergMergeSink extends DataSink {
comments pls


http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

http://gerrit.cloudera.org:8080/#/c/21423/7/fe/src/main/java/org/apache/impala/planner/Planner.java@1032
PS7, Line 1032:     return merge.getPlanNode(ctx_, planNode, analyzer);
As commented on MergeStmt, an analyzer class shouldn't be responsible of 
creating a plan node. It's the responsibility of the planner.



--
To view, visit http://gerrit.cloudera.org:8080/21423
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3416a79740eddc446c87f72bf1a85ed3f71af268
Gerrit-Change-Number: 21423
Gerrit-PatchSet: 7
Gerrit-Owner: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Gabor Kaszab <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Noemi Pap-Takacs <[email protected]>
Gerrit-Reviewer: Peter Rozsa <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Tue, 18 Jun 2024 08:50:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-12732: Add support for MERGE statements for Iceberg tables

Reply via email to