[jira] [Updated] (CALCITE-6893) Remove agg from Union children in IntersectToDistinctRule

Zhen Chen (Jira) Sat, 15 Mar 2025 20:16:13 -0700


     [ 
https://issues.apache.org/jira/browse/CALCITE-6893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zhen Chen updated CALCITE-6893:
-------------------------------
    Description: 
Whether agg should be pushed down should be determined by the cost.

SQL:

 
{code:java}
select ename from emp where deptno = 10
intersect
select ename from emp where deptno = 20{code}
 

Then used rule INTERSECT_TO_DISTINCT(updated version)  and 
AGGREGATE_UNION_TRANSPOSE in hep planner.

We can get logical plan:

 
{code:java}
LogicalProject(ENAME=[$0])
  LogicalFilter(condition=[=($1, 2)])
    LogicalAggregate(group=[{0}], agg#0=[$SUM0($1)])
      LogicalUnion(all=[true])
        LogicalAggregate(group=[{0}], agg#0=[COUNT()])
          LogicalProject(ENAME=[$1])
            LogicalFilter(condition=[=($7, 10)])
              LogicalTableScan(table=[[CATALOG, SALES, EMP]])
        LogicalAggregate(group=[{0}], agg#0=[COUNT()])
          LogicalProject(ENAME=[$1])
            LogicalFilter(condition=[=($7, 20)])
              LogicalTableScan(table=[[CATALOG, SALES, EMP]]) {code}
Then we also use the two same rules in volcanol planner.

 

Final Phy Plan:

 
{code:java}
EnumerableProject(ENAME=[$0]): rowcount = 1.0, cumulative cost = 
{43.72500000000001 rows, 68.4 cpu, 0.0 io}, id = 85
  EnumerableFilter(condition=[=($1, 2)]): rowcount = 1.0, cumulative cost = 
{42.72500000000001 rows, 67.4 cpu, 0.0 io}, id = 84
    EnumerableAggregate(group=[{0}], agg#0=[COUNT()]): rowcount = 1.0, 
cumulative cost = {41.72500000000001 rows, 66.4 cpu, 0.0 io}, id = 83
      EnumerableUnion(all=[true]): rowcount = 4.2, cumulative cost = 
{40.60000000000001 rows, 66.4 cpu, 0.0 io}, id = 82
        EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = 
{18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 79
          EnumerableFilter(condition=[=($7, 10)]): rowcount = 2.1, cumulative 
cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 78
            EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = 
14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69
        EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = 
{18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 81
          EnumerableFilter(condition=[=($7, 20)]): rowcount = 2.1, cumulative 
cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 80
            EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = 
14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69
 {code}
 

We can see the best plan, the children of union do not have agg. 

DAG:

!image-2025-03-16-11-02-04-725.png|width=545,height=348!

Currently, Calcite does not support distributed planning. If in a distributed 
planning, agg will be divided into two stages. If the filtering effect in the 
first stage is very good, the downward push of agg will be meaningful and 
reduce the network transmission of shuffle. However, optimizing the current 
rule is also meaningful. Calcite now also has rules that can do the downward 
push of agg. We can give the choice to the volcano.

  was:
Whether agg should be pushed down should be determined by the cost.

SQL:

 

Final Phy Plan:

 

DAG:

 


> Remove agg from Union children in IntersectToDistinctRule
> ---------------------------------------------------------
>
>                 Key: CALCITE-6893
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6893
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter: Zhen Chen
>            Assignee: Zhen Chen
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2025-03-16-11-02-04-725.png
>
>
> Whether agg should be pushed down should be determined by the cost.
> SQL:
>  
> {code:java}
> select ename from emp where deptno = 10
> intersect
> select ename from emp where deptno = 20{code}
>  
> Then used rule INTERSECT_TO_DISTINCT(updated version)  and 
> AGGREGATE_UNION_TRANSPOSE in hep planner.
> We can get logical plan:
>  
> {code:java}
> LogicalProject(ENAME=[$0])
>   LogicalFilter(condition=[=($1, 2)])
>     LogicalAggregate(group=[{0}], agg#0=[$SUM0($1)])
>       LogicalUnion(all=[true])
>         LogicalAggregate(group=[{0}], agg#0=[COUNT()])
>           LogicalProject(ENAME=[$1])
>             LogicalFilter(condition=[=($7, 10)])
>               LogicalTableScan(table=[[CATALOG, SALES, EMP]])
>         LogicalAggregate(group=[{0}], agg#0=[COUNT()])
>           LogicalProject(ENAME=[$1])
>             LogicalFilter(condition=[=($7, 20)])
>               LogicalTableScan(table=[[CATALOG, SALES, EMP]]) {code}
> Then we also use the two same rules in volcanol planner.
>  
> Final Phy Plan:
>  
> {code:java}
> EnumerableProject(ENAME=[$0]): rowcount = 1.0, cumulative cost = 
> {43.72500000000001 rows, 68.4 cpu, 0.0 io}, id = 85
>   EnumerableFilter(condition=[=($1, 2)]): rowcount = 1.0, cumulative cost = 
> {42.72500000000001 rows, 67.4 cpu, 0.0 io}, id = 84
>     EnumerableAggregate(group=[{0}], agg#0=[COUNT()]): rowcount = 1.0, 
> cumulative cost = {41.72500000000001 rows, 66.4 cpu, 0.0 io}, id = 83
>       EnumerableUnion(all=[true]): rowcount = 4.2, cumulative cost = 
> {40.60000000000001 rows, 66.4 cpu, 0.0 io}, id = 82
>         EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = 
> {18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 79
>           EnumerableFilter(condition=[=($7, 10)]): rowcount = 2.1, cumulative 
> cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 78
>             EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = 
> 14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69
>         EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = 
> {18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 81
>           EnumerableFilter(condition=[=($7, 20)]): rowcount = 2.1, cumulative 
> cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 80
>             EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = 
> 14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69
>  {code}
>  
> We can see the best plan, the children of union do not have agg. 
> DAG:
> !image-2025-03-16-11-02-04-725.png|width=545,height=348!
> Currently, Calcite does not support distributed planning. If in a distributed 
> planning, agg will be divided into two stages. If the filtering effect in the 
> first stage is very good, the downward push of agg will be meaningful and 
> reduce the network transmission of shuffle. However, optimizing the current 
> rule is also meaningful. Calcite now also has rules that can do the downward 
> push of agg. We can give the choice to the volcano.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (CALCITE-6893) Remove agg from Union children in IntersectToDistinctRule

Reply via email to