Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21239 )

Change subject: IMPALA-13043: Implement Join Capability to the Calcite Planner
......................................................................


Patch Set 16:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/21239/16//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21239/16//COMMIT_MSG@28
PS16, Line 28: There is some special case logic in the join that requires the 
execution
             : engine to run on a single node. The original logic can be found 
in the
             : JoinNode planner object, but this code isn't executed. The new 
mechanism
             : for checking single node executions is passed back to the root 
via the
             : NodeWithExprs object, and is checked in the 
ImpalaJoinRel.useSingleNode
             : method.
> Ok...I had to reexamine this. I didn't code this part the first time and pu
Impala's current planner will throw an error for some SQL statements that can't 
run in a distributed way. For example, a nested loop join with a FULL OUTER 
join will fail at that location in SingleNodePlanner.validatePlan(). Example 
SQL:
select *
from functional.alltypestiny a full outer join functional.alltypessmall b
  on a.id != b.id or a.int_col != b.int_col
where a.id < 10;

A user can get around it by setting num_nodes=1. It's a little bit weird, but 
that is how Impala works today. Maybe an error can be useful, because sometimes 
users forget to add a join condition and don't realize they are doing something 
that will be single-threaded.

invertJoins() can convert nested loop join with RIGHT OUTER or RIGHT SEMI into 
LEFT OUTER / LEFT SEMI joins, which are compatible with being distributed. So, 
I don't think we want to fail before that runs. (invertJoins() will eventually 
be a Calcite rule, but for now it can handle this case.)

I think we could drop the num_nodes=1 code and just let it fail at 
SingleNodePlanner.validatePlan(). I don't think that would impact any of the 
TPC queries.


http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/functions/AnalyzedBinaryCompExpr.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/functions/AnalyzedBinaryCompExpr.java:

http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/functions/AnalyzedBinaryCompExpr.java@65
PS16, Line 65:     return 1;
> Yeah, IIRC (and I may be misremembering) I think it crashed in PlanNode.ord
Ok, makes sense


http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/rel/node/ImpalaJoinRel.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/rel/node/ImpalaJoinRel.java:

http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/rel/node/ImpalaJoinRel.java@258
PS16, Line 258:       // Mapping it to a Left Semi Join in Impala seems to make
              :       // sense since it is unclear when we would need a
              :       // Right Semi Join
              :       return JoinOperator.LEFT_SEMI_JOIN;
> Yeah, we're probably going to have to make some Calcite changes to handle t
Yeah, this is not a blocker for this change. Let's update this comment to say 
that Calcite SEMI is always a LEFT SEMI.


http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/service/ExecRequestCreator.java
File 
java/calcite-planner/src/main/java/org/apache/impala/calcite/service/ExecRequestCreator.java:

http://gerrit.cloudera.org:8080/#/c/21239/16/java/calcite-planner/src/main/java/org/apache/impala/calcite/service/ExecRequestCreator.java@120
PS16, Line 120:     if (useSingleNode) {
              :       plannerContext.getQueryOptions().setNum_nodes(1);
              :     }
> My hope is that by the end of this project, there is no need to call "inver
There are certain joins that we don't support for distributed plans. In 
particular, nested loop join does not support distributed RIGHT OUTER or RIGHT 
SEMI. So, the current planner prefers to flip those to LEFT OUTER and LEFT SEMI 
so it can use distributed execution. It doesn't do that based on cost right 
now, but there is an implicit calculation that a distributed plan is better 
than a single-node plan. The join inversion can save this from hitting the 
condition in SingleNodePlanner::validatePlan().



--
To view, visit http://gerrit.cloudera.org:8080/21239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5db097577907d79877f52feff2922000af074ecd
Gerrit-Change-Number: 21239
Gerrit-PatchSet: 16
Gerrit-Owner: Steve Carlin <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Steve Carlin <[email protected]>
Gerrit-Comment-Date: Tue, 03 Sep 2024 19:00:55 +0000
Gerrit-HasComments: Yes

Reply via email to