Rajesh Balamohan created HIVE-26520: ---------------------------------------
Summary: Improve dynamic partition pruning operator when subqueries are involved Key: HIVE-26520 URL: https://issues.apache.org/jira/browse/HIVE-26520 Project: Hive Issue Type: Improvement Components: HiveServer2 Reporter: Rajesh Balamohan Attachments: q58_test.pdf Dynamic partition pruning operator sends entire date_dim table and due to this, entire catalog_sales data is scanned causing huge IO and decoding cost. If dynamic partition pruning operator was created after the "date_dim" subquery has been evaluated, it would have saved huge IO cost. E.g It would have just taken 6-7 partition scans instead of 1800+ partitions. Consider the following simplified query as example {noformat} select count(*) from (select i_item_id item_id ,sum(cs_ext_sales_price) cs_item_rev from catalog_sales ,item ,date_dim where cs_item_sk = i_item_sk and d_date in (select d_date from date_dim where d_week_seq = (select d_week_seq from date_dim where d_date = '1998-02-21')) and cs_sold_date_sk = d_date_sk group by i_item_id) a; CBO PLAN: HiveAggregate(group=[{}], agg#0=[count()]) HiveProject(i_item_id=[$0]) HiveAggregate(group=[{4}]) HiveSemiJoin(condition=[=($6, $7)], joinType=[semi]) HiveJoin(condition=[=($2, $5)], joinType=[inner]) HiveJoin(condition=[=($0, $3)], joinType=[inner]) HiveProject(cs_item_sk=[$14], cs_ext_sales_price=[$22], cs_sold_date_sk=[$33]) HiveFilter(condition=[AND(IS NOT NULL($33), IS NOT NULL($14))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, catalog_sales]], table:alias=[catalog_sales]) HiveProject(i_item_sk=[$0], i_item_id=[$1]) HiveFilter(condition=[IS NOT NULL($0)]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, item]], table:alias=[item]) HiveProject(d_date_sk=[$0], d_date=[$2]) HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($0))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim]) HiveProject(d_date=[$0]) HiveJoin(condition=[=($1, $3)], joinType=[inner]) HiveJoin(condition=[true], joinType=[inner]) HiveProject(d_date=[$2], d_week_seq=[$4]) HiveFilter(condition=[AND(IS NOT NULL($2), IS NOT NULL($4))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim]) HiveProject(cnt=[$0]) HiveFilter(condition=[<=(sq_count_check($0), 1)]) HiveProject(cnt=[$0]) HiveAggregate(group=[{}], cnt=[COUNT()]) HiveFilter(condition=[=($2, 1998-02-21)]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim]) HiveProject(d_week_seq=[$4]) HiveFilter(condition=[AND(=($2, 1998-02-21), IS NOT NULL($4))]) HiveTableScan(table=[[tpcds_bin_partitioned_orc_10000_external, date_dim]], table:alias=[date_dim]) {noformat} I will attach the formatted plan for reference as well. If planner generated the dynamic partition pruning event after "date_dim" got evaluated in "Map 7", it would be been very efficient. -- This message was sent by Atlassian Jira (v8.20.10#820010)