[ https://issues.apache.org/jira/browse/HIVE-21340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780025#comment-16780025 ]
Vineet Garg commented on HIVE-21340: ------------------------------------ Problem is with HiveSemiJoinRule. Column pruning is occurring e.g. the plan just before HiveSemiJoinRule is: {code:sql} HiveAggregate(group=[{}], agg#0=[count()]) HiveJoin(condition=[=($0, $1)], joinType=[inner], algorithm=[none], cost=[not available]) HiveProject(i_item_sk=[$0]) HiveFilter(condition=[IS NOT NULL($0)]) HiveTableScan(table=[[perf, item]], table:alias=[item]) HiveAggregate(group=[{0}]) HiveFilter(condition=[>($2, 1)]) HiveAggregate(group=[{2, 9}], agg#0=[count()]) HiveFilter(condition=[IS NOT NULL($2)]) HiveTableScan(table=[[perf, store_sales]], table:alias=[store_sales]) {code} HiveSemiJoinRule rewrites the HiveJoin + HIveAggregate into HiveSemiJoin. It does not introduce HiveProject as replacement of HiveAggregate, as a result schema changes to whatever HiveAggregate's input is (HiveFilter in this case) > CBO: Prune non-key columns feeding into a SemiJoin > -------------------------------------------------- > > Key: HIVE-21340 > URL: https://issues.apache.org/jira/browse/HIVE-21340 > Project: Hive > Issue Type: Bug > Components: CBO > Affects Versions: 4.0.0 > Reporter: Gopal V > Assignee: Vineet Garg > Priority: Major > > {code} > explain cbo > with ss as > (select count(1), ss_item_sk, ss_ticket_number from > store_sales group by ss_item_sk, ss_ticket_number > having count(1) > 1) > select count(1) from item where i_item_sk IN (select ss_item_sk from ss); > {code} > Notice the {{HiveProject(ss_item_sk=[$0], ss_ticket_number=[$1], $f2=[$2])}} > Only ss_item_sk is relevant for the HiveSemiJoin > {code} > CBO PLAN: > HiveAggregate(group=[{}], agg#0=[count()]) > HiveSemiJoin(condition=[=($0, $1)], joinType=[inner]) > HiveProject(i_item_sk=[$0]) > HiveFilter(condition=[IS NOT NULL($0)]) > HiveTableScan(table=[[tpcds_copy_orc_partitioned_10000, item]], > table:alias=[item]) > HiveProject(ss_item_sk=[$0], ss_ticket_number=[$1], $f2=[$2]) > HiveFilter(condition=[>($2, 1)]) > HiveAggregate(group=[{1, 8}], agg#0=[count()]) > HiveFilter(condition=[IS NOT NULL($1)]) > HiveTableScan(table=[[tpcds_copy_orc_partitioned_10000, > store_sales]], table:alias=[store_sales]) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)