[ https://issues.apache.org/jira/browse/HIVE-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15469063#comment-15469063 ]
Ashutosh Chauhan commented on HIVE-14708: ----------------------------------------- This is different than what you are asking, because result set is independent for two insert statement whereas for NOT IN one result set is used in the next one. > Optimizer: NOT IN query scans one input two times > ------------------------------------------------- > > Key: HIVE-14708 > URL: https://issues.apache.org/jira/browse/HIVE-14708 > Project: Hive > Issue Type: Bug > Components: Logical Optimizer > Affects Versions: 2.2.0 > Reporter: Gopal V > Priority: Critical > > {code} > hive (tpcds_bin_partitioned_orc_1000)> explain select count(1) from > store_sales where ss_sold_date_sk NOT in (select d_date_sk from date_dim); > Stage-1 > Reducer 2 vectorized, llap > File Output Operator [FS_52] > Group By Operator [GBY_51] (rows=1 width=8) > Output:["_col0"],aggregations:["count(VALUE._col0)"] > <-Map 1 [SIMPLE_EDGE] vectorized, llap > SHUFFLE [RS_50] > Group By Operator [GBY_49] (rows=1 width=8) > Output:["_col0"],aggregations:["count(1)"] > Select Operator [SEL_48] (rows=1 width=4) > Filter Operator [FIL_47] (rows=1 width=4) > predicate:_col2 is null > Map Join Operator [MAPJOIN_46] (rows=2879987999 width=4) > Conds:MAPJOIN_45._col0=RS_43._col0(Left > Outer),Output:["_col2"] > <-Map 5 [BROADCAST_EDGE] vectorized, llap > BROADCAST [RS_43] > PartitionCols:_col0 > Select Operator [SEL_42] (rows=73049 width=4) > Output:["_col0"] > TableScan [TS_11] (rows=73049 width=4) > > tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"] > <-Map Join Operator [MAPJOIN_45] (rows=2879987999 width=4) > Conds:(Inner),Output:["_col0"] > <-Reducer 4 [BROADCAST_EDGE] vectorized, llap > BROADCAST [RS_41] > Select Operator [SEL_40] (rows=1 width=8) > Filter Operator [FIL_39] (rows=1 width=8) > predicate:(_col0 = 0) > Group By Operator [GBY_38] (rows=1 width=8) > > Output:["_col0"],aggregations:["count(VALUE._col0)"] > <-Map 3 [SIMPLE_EDGE] vectorized, llap > SHUFFLE [RS_37] > Group By Operator [GBY_36] (rows=1 width=8) > Output:["_col0"],aggregations:["count()"] > Select Operator [SEL_35] (rows=1 width=4) > Filter Operator [FIL_34] (rows=1 width=4) > predicate:d_date_sk is null > TableScan [TS_2] (rows=73049 width=4) > > tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"] > <-Select Operator [SEL_44] (rows=2879987999 width=4) > Output:["_col0"] > TableScan [TS_0] (rows=2879987999 width=92) > > tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE > {code} > The 2nd scan is merely to count the number of NULLs and has > {{predicate:d_date_sk is null}} in the operator. > The NULL checks can be done inline with the NOT-NULL codepath instead of > producing 2 independent full-scans of the date_dim table. > This is not significant in a scenario like the above where the small table > side is an actual HDFS table, but entirely throttles performance when the > small side is actually an expensive aggregate. -- This message was sent by Atlassian JIRA (v6.3.4#6332)