[ https://issues.apache.org/jira/browse/CALCITE-6893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17936278#comment-17936278 ]
Julian Hyde commented on CALCITE-6893: -------------------------------------- That query is valid. I would support having a variant of the rule that generates count-distinct. However, many implementations of count-distinct will introduce an intermediate "Aggregate(job, i)" step, and these will be equivalent to the current plan. (Some implementations, such as BigQuery, will use a small array; for these implementations, count-distinct will be a win.) If the number of branches (N) is small, we can use other aggregate functions. If N = 2 we can use min and max: {code} select job from ( select job, 0 as i from "scott".emp where deptno = 10 union all select job, 1 as i from "scott".emp where deptno = 20) group by job having min(i) = 0 and max(i) = 1; {code} For small N we can use separate {{count}} aggregate functions: {code} select job from ( select job, 0 as i from "scott".emp where deptno = 10 union all select job, 1 as i from "scott".emp where deptno = 20) group by job having count(*) filter (where i = 0) > 0 and count(*) filter (where i = 1) > 0; {code} or equivalently {code} select job from ( select job, 1 as i0, null as i1 from "scott".emp where deptno = 10 union all select job, null as i0, 1 as i1 from "scott".emp where deptno = 20) group by job having count(i0) > 0 and count(i1) > 0; {code} > Remove agg from Union children in IntersectToDistinctRule > --------------------------------------------------------- > > Key: CALCITE-6893 > URL: https://issues.apache.org/jira/browse/CALCITE-6893 > Project: Calcite > Issue Type: Improvement > Reporter: Zhen Chen > Assignee: Zhen Chen > Priority: Major > Labels: pull-request-available > Attachments: image-2025-03-16-11-02-04-725.png > > > Whether agg should be pushed down should be determined by the cost. > SQL: > > {code:java} > select ename from emp where deptno = 10 > intersect > select ename from emp where deptno = 20{code} > > Then used rule INTERSECT_TO_DISTINCT(updated version) and > AGGREGATE_UNION_TRANSPOSE in hep planner. > We can get logical plan: > > {code:java} > LogicalProject(ENAME=[$0]) > LogicalFilter(condition=[=($1, 2)]) > LogicalAggregate(group=[{0}], agg#0=[$SUM0($1)]) > LogicalUnion(all=[true]) > LogicalAggregate(group=[{0}], agg#0=[COUNT()]) > LogicalProject(ENAME=[$1]) > LogicalFilter(condition=[=($7, 10)]) > LogicalTableScan(table=[[CATALOG, SALES, EMP]]) > LogicalAggregate(group=[{0}], agg#0=[COUNT()]) > LogicalProject(ENAME=[$1]) > LogicalFilter(condition=[=($7, 20)]) > LogicalTableScan(table=[[CATALOG, SALES, EMP]]) {code} > Then we also use the two same rules in volcanol planner. > > Final Phy Plan: > > {code:java} > EnumerableProject(ENAME=[$0]): rowcount = 1.0, cumulative cost = > {43.72500000000001 rows, 68.4 cpu, 0.0 io}, id = 85 > EnumerableFilter(condition=[=($1, 2)]): rowcount = 1.0, cumulative cost = > {42.72500000000001 rows, 67.4 cpu, 0.0 io}, id = 84 > EnumerableAggregate(group=[{0}], agg#0=[COUNT()]): rowcount = 1.0, > cumulative cost = {41.72500000000001 rows, 66.4 cpu, 0.0 io}, id = 83 > EnumerableUnion(all=[true]): rowcount = 4.2, cumulative cost = > {40.60000000000001 rows, 66.4 cpu, 0.0 io}, id = 82 > EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = > {18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 79 > EnumerableFilter(condition=[=($7, 10)]): rowcount = 2.1, cumulative > cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 78 > EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = > 14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69 > EnumerableProject(ENAME=[$1]): rowcount = 2.1, cumulative cost = > {18.200000000000003 rows, 31.1 cpu, 0.0 io}, id = 81 > EnumerableFilter(condition=[=($7, 20)]): rowcount = 2.1, cumulative > cost = {16.1 rows, 29.0 cpu, 0.0 io}, id = 80 > EnumerableTableScan(table=[[CATALOG, SALES, EMP]]): rowcount = > 14.0, cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 69 > {code} > > We can see the best plan, the children of union do not have agg. > DAG: > !image-2025-03-16-11-02-04-725.png|width=545,height=348! > Currently, Calcite does not support distributed planning. If in a distributed > planning, agg will be divided into two stages. If the filtering effect in the > first stage is very good, the downward push of agg will be meaningful and > reduce the network transmission of shuffle. However, optimizing the current > rule is also meaningful. Calcite now also has rules that can do the downward > push of agg. We can give the choice to the volcano. -- This message was sent by Atlassian Jira (v8.20.10#820010)