[jira] [Updated] (CALCITE-7203) IntersectToSemiJoinRule should compute once the join keys and reuse them to avoid duplicates

Alessandro Solimando (Jira) Sat, 27 Sep 2025 02:46:32 -0700


     [ 
https://issues.apache.org/jira/browse/CALCITE-7203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alessandro Solimando updated CALCITE-7203:
------------------------------------------
    Description: 
[IntersectToSemiJoinRule|https://github.com/apache/calcite/blob/9014934d8c24a5242a6840efe20134e820426c24/core/src/main/java/org/apache/calcite/rel/rules/IntersectToSemiJoinRule.java#L119-L128]
 repeatedly creates cast expressions between pair of intersect operands, while 
we could "pre-compute" these join keys targeting the row type of the n-way 
intersect expression, which is the final type that all intersect operands must 
conform to.

Computing the join keys pair-wise, the current implementation, introduces 
duplicates and noise due to the partial type unification vs the stable type 
unification over the final/global row type.

[planner.iq#L150-L179|https://github.com/apache/calcite/blob/9014934d8c24a5242a6840efe20134e820426c24/core/src/test/resources/sql/planner.iq#L150-L179]
 could be simplified;

before:
{noformat}
EnumerableCalc(expr#0..1=[{inputs}], expr#2=[CAST($t0):DECIMAL(11, 1)], A=[$t2])
  EnumerableHashJoin(condition=[=($1, $3)], joinType=[semi])
    EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
proj#0..1=[{exprs}])
      EnumerableAggregate(group=[{0}])
        EnumerableHashJoin(condition=[=($1, $3)], joinType=[semi])
          EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
NOT NULL], A=[$t1], A0=[$t1])
            EnumerableValues(tuples=[[{ 1.0 }, { 2.0 }, { 3.0 }, { 4.0 }, { 5.0 
}]])
          EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
NOT NULL], A=[$t1], A0=[$t1])
            EnumerableValues(tuples=[[{ 1 }, { 2 }]])
    EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
A=[$t1], A0=[$t1]) <= extra A0
      EnumerableValues(tuples=[[{ 1.0 }, { 4.0 }, { null }]]){noformat}
after:
{noformat}
EnumerableAggregate(group=[{0}])
  EnumerableNestedLoopJoin(condition=[IS NOT DISTINCT FROM($0, $1)], 
joinType=[semi])
    EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
A=[$t1])
      EnumerableAggregate(group=[{0}])
        EnumerableNestedLoopJoin(condition=[IS NOT DISTINCT FROM($0, $1)], 
joinType=[semi])
          EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
NOT NULL], A=[$t1])
            EnumerableValues(tuples=[[{ 1.0 }, { 2.0 }, { 3.0 }, { 4.0 }, { 5.0 
}]])
          EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
NOT NULL], A=[$t1]) <= no more A0
            EnumerableValues(tuples=[[{ 1 }, { 2 }]])
    EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
A=[$t1])
      EnumerableValues(tuples=[[{ 1.0 }, { 4.0 }, { null }]]){noformat}
[This PR 
discussion|https://github.com/apache/calcite/pull/4557#discussion_r2384022473] 
elaborates even more on why this is needed.

  was:
[IntersectToSemiJoinRule|https://github.com/apache/calcite/blob/9014934d8c24a5242a6840efe20134e820426c24/core/src/main/java/org/apache/calcite/rel/rules/IntersectToSemiJoinRule.java#L119-L128]
 repeatedly creates cast expressions between pair of intersect operands, while 
we could "pre-compute" these join keys targeting the row type of the n-way 
intersect expression, which is the final type that all intersect operands must 
conform to.

Computing the join keys pair-wise, the current implementation, introduces 
duplicates and noise due to the partial type unification vs the stable type 
unification over the final/global row type.

[planner.iq#L150-L179|https://github.com/apache/calcite/blob/9014934d8c24a5242a6840efe20134e820426c24/core/src/test/resources/sql/planner.iq#L150-L179]
 could be simplified;

before:
{noformat}
EnumerableCalc(expr#0..1=[{inputs}], expr#2=[CAST($t0):DECIMAL(11, 1)], A=[$t2])
  EnumerableHashJoin(condition=[=($1, $3)], joinType=[semi])
    EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
proj#0..1=[{exprs}])
      EnumerableAggregate(group=[{0}])
        EnumerableHashJoin(condition=[=($1, $3)], joinType=[semi])
          EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
NOT NULL], A=[$t1], A0=[$t1])
            EnumerableValues(tuples=[[{ 1.0 }, { 2.0 }, { 3.0 }, { 4.0 }, { 5.0 
}]])
          EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
NOT NULL], A=[$t1], A0=[$t1])
            EnumerableValues(tuples=[[{ 1 }, { 2 }]])
    EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
A=[$t1], A0=[$t1]) <= extra A0
      EnumerableValues(tuples=[[{ 1.0 }, { 4.0 }, { null }]]){noformat}
after:
{noformat}
EnumerableAggregate(group=[{0}])
  EnumerableNestedLoopJoin(condition=[IS NOT DISTINCT FROM($0, $1)], 
joinType=[semi])
    EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
A=[$t1])
      EnumerableAggregate(group=[{0}])
        EnumerableNestedLoopJoin(condition=[IS NOT DISTINCT FROM($0, $1)], 
joinType=[semi])
          EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
NOT NULL], A=[$t1])
            EnumerableValues(tuples=[[{ 1.0 }, { 2.0 }, { 3.0 }, { 4.0 }, { 5.0 
}]])
          EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
NOT NULL], A=[$t1]) <= no more A0
            EnumerableValues(tuples=[[{ 1 }, { 2 }]])
    EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
A=[$t1])
      EnumerableValues(tuples=[[{ 1.0 }, { 4.0 }, { null }]]){noformat}


> IntersectToSemiJoinRule should compute once the join keys and reuse them to 
> avoid duplicates
> --------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-7203
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7203
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.40.0
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>
> [IntersectToSemiJoinRule|https://github.com/apache/calcite/blob/9014934d8c24a5242a6840efe20134e820426c24/core/src/main/java/org/apache/calcite/rel/rules/IntersectToSemiJoinRule.java#L119-L128]
>  repeatedly creates cast expressions between pair of intersect operands, 
> while we could "pre-compute" these join keys targeting the row type of the 
> n-way intersect expression, which is the final type that all intersect 
> operands must conform to.
> Computing the join keys pair-wise, the current implementation, introduces 
> duplicates and noise due to the partial type unification vs the stable type 
> unification over the final/global row type.
> [planner.iq#L150-L179|https://github.com/apache/calcite/blob/9014934d8c24a5242a6840efe20134e820426c24/core/src/test/resources/sql/planner.iq#L150-L179]
>  could be simplified;
> before:
> {noformat}
> EnumerableCalc(expr#0..1=[{inputs}], expr#2=[CAST($t0):DECIMAL(11, 1)], 
> A=[$t2])
>   EnumerableHashJoin(condition=[=($1, $3)], joinType=[semi])
>     EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
> proj#0..1=[{exprs}])
>       EnumerableAggregate(group=[{0}])
>         EnumerableHashJoin(condition=[=($1, $3)], joinType=[semi])
>           EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
> NOT NULL], A=[$t1], A0=[$t1])
>             EnumerableValues(tuples=[[{ 1.0 }, { 2.0 }, { 3.0 }, { 4.0 }, { 
> 5.0 }]])
>           EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
> NOT NULL], A=[$t1], A0=[$t1])
>             EnumerableValues(tuples=[[{ 1 }, { 2 }]])
>     EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
> A=[$t1], A0=[$t1]) <= extra A0
>       EnumerableValues(tuples=[[{ 1.0 }, { 4.0 }, { null }]]){noformat}
> after:
> {noformat}
> EnumerableAggregate(group=[{0}])
>   EnumerableNestedLoopJoin(condition=[IS NOT DISTINCT FROM($0, $1)], 
> joinType=[semi])
>     EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
> A=[$t1])
>       EnumerableAggregate(group=[{0}])
>         EnumerableNestedLoopJoin(condition=[IS NOT DISTINCT FROM($0, $1)], 
> joinType=[semi])
>           EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
> NOT NULL], A=[$t1])
>             EnumerableValues(tuples=[[{ 1.0 }, { 2.0 }, { 3.0 }, { 4.0 }, { 
> 5.0 }]])
>           EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1) 
> NOT NULL], A=[$t1]) <= no more A0
>             EnumerableValues(tuples=[[{ 1 }, { 2 }]])
>     EnumerableCalc(expr#0=[{inputs}], expr#1=[CAST($t0):DECIMAL(11, 1)], 
> A=[$t1])
>       EnumerableValues(tuples=[[{ 1.0 }, { 4.0 }, { null }]]){noformat}
> [This PR 
> discussion|https://github.com/apache/calcite/pull/4557#discussion_r2384022473]
>  elaborates even more on why this is needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (CALCITE-7203) IntersectToSemiJoinRule should compute once the join keys and reuse them to avoid duplicates

Reply via email to