[ 
https://issues.apache.org/jira/browse/HIVE-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468847#comment-15468847
 ] 

Gopal V commented on HIVE-14708:
--------------------------------

bq. I am not sure how is it possible to do this in a single scan

Hive actually does that in other scenarios, when inserting into tables. Try the 
following query in Tez

{code}
create temporary table null_date(nul bigint);
create temporary table min_date(mm bigint);

explain 
from date_dim
insert overwrite table min_date select min(d_date_sk)
insert overwrite table null_date select count(1) where d_date_sk is null ;
{code}

> Optimizer: NOT IN query scans one input two times
> -------------------------------------------------
>
>                 Key: HIVE-14708
>                 URL: https://issues.apache.org/jira/browse/HIVE-14708
>             Project: Hive
>          Issue Type: Bug
>          Components: Logical Optimizer
>    Affects Versions: 2.2.0
>            Reporter: Gopal V
>            Priority: Critical
>
> {code}
> hive (tpcds_bin_partitioned_orc_1000)> explain select count(1) from 
> store_sales where ss_sold_date_sk NOT in (select d_date_sk from date_dim);
>     Stage-1
>       Reducer 2 vectorized, llap
>       File Output Operator [FS_52]
>         Group By Operator [GBY_51] (rows=1 width=8)
>           Output:["_col0"],aggregations:["count(VALUE._col0)"]
>         <-Map 1 [SIMPLE_EDGE] vectorized, llap
>           SHUFFLE [RS_50]
>             Group By Operator [GBY_49] (rows=1 width=8)
>               Output:["_col0"],aggregations:["count(1)"]
>               Select Operator [SEL_48] (rows=1 width=4)
>                 Filter Operator [FIL_47] (rows=1 width=4)
>                   predicate:_col2 is null
>                   Map Join Operator [MAPJOIN_46] (rows=2879987999 width=4)
>                     Conds:MAPJOIN_45._col0=RS_43._col0(Left 
> Outer),Output:["_col2"]
>                   <-Map 5 [BROADCAST_EDGE] vectorized, llap
>                     BROADCAST [RS_43]
>                       PartitionCols:_col0
>                       Select Operator [SEL_42] (rows=73049 width=4)
>                         Output:["_col0"]
>                         TableScan [TS_11] (rows=73049 width=4)
>                           
> tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
>                   <-Map Join Operator [MAPJOIN_45] (rows=2879987999 width=4)
>                       Conds:(Inner),Output:["_col0"]
>                     <-Reducer 4 [BROADCAST_EDGE] vectorized, llap
>                       BROADCAST [RS_41]
>                         Select Operator [SEL_40] (rows=1 width=8)
>                           Filter Operator [FIL_39] (rows=1 width=8)
>                             predicate:(_col0 = 0)
>                             Group By Operator [GBY_38] (rows=1 width=8)
>                               
> Output:["_col0"],aggregations:["count(VALUE._col0)"]
>                             <-Map 3 [SIMPLE_EDGE] vectorized, llap
>                               SHUFFLE [RS_37]
>                                 Group By Operator [GBY_36] (rows=1 width=8)
>                                   Output:["_col0"],aggregations:["count()"]
>                                   Select Operator [SEL_35] (rows=1 width=4)
>                                     Filter Operator [FIL_34] (rows=1 width=4)
>                                       predicate:d_date_sk is null
>                                       TableScan [TS_2] (rows=73049 width=4)
>                                         
> tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
>                     <-Select Operator [SEL_44] (rows=2879987999 width=4)
>                         Output:["_col0"]
>                         TableScan [TS_0] (rows=2879987999 width=92)
>                           
> tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE
> {code}
> The 2nd scan is merely to count the number of NULLs and has 
> {{predicate:d_date_sk is null}} in the operator.
> The NULL checks can be done inline with the NOT-NULL codepath instead of 
> producing 2 independent full-scans of the date_dim table.
> This is not significant in a scenario like the above where the small table 
> side is an actual HDFS table, but entirely throttles performance when the 
> small side is actually an expensive aggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to