[jira] [Commented] (HIVE-15758) Allow correlated scalar subqueries with aggregates which has non-equi join predicates

Vineet Garg (JIRA) Fri, 14 Jul 2017 14:11:51 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088071#comment-16088071
 ]


Vineet Garg commented on HIVE-15758:
------------------------------------

[~pxiong] For the query mentioned above in this JIRA hive generates following 
plan
{noformat}
Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 2
      File Output Operator [FS_24]
        Select Operator [SEL_23] (rows=185 width=244)
          
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"]
          Filter Operator [FIL_22] (rows=185 width=244)
            predicate:(UDFToLong(_col5) <> CASE WHEN (_col10 is null) THEN (0) 
ELSE (_col9) END)
            Merge Join Operator [MERGEJOIN_29] (rows=185 width=244)
              Conds:RS_19._col4=RS_20._col2(Left 
Outer),Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8","_col9","_col10"]
            <-Map 1 [SIMPLE_EDGE]
              SHUFFLE [RS_19]
                PartitionCols:_col4
                Select Operator [SEL_1] (rows=26 width=121)
                  
Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"]
                  TableScan [TS_0] (rows=26 width=121)
                    
default@part,part,Tbl:COMPLETE,Col:NONE,Output:["p_partkey","p_name","p_mfgr","p_brand","p_type","p_size","p_container","p_retailprice","p_comment"]
            <-Reducer 5 [SIMPLE_EDGE]
              SHUFFLE [RS_20]
                PartitionCols:_col2
                Select Operator [SEL_18] (rows=169 width=243)
                  Output:["_col0","_col1","_col2"]
                  Group By Operator [GBY_17] (rows=169 width=243)
                    
Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0
                  <-Reducer 4 [SIMPLE_EDGE]
                    SHUFFLE [RS_16]
                      PartitionCols:_col0
                      Group By Operator [GBY_15] (rows=338 width=243)
                        
Output:["_col0","_col1"],aggregations:["count(_col1)"],keys:_col2
                        Select Operator [SEL_14] (rows=338 width=243)
                          Output:["_col1","_col2"]
                          Filter Operator [FIL_13] (rows=338 width=243)
                            predicate:(_col2 <> _col0)
                            Merge Join Operator [MERGEJOIN_28] (rows=338 
width=243)
                              Conds:(Inner),Output:["_col0","_col1","_col2"]
                            <-Map 3 [CUSTOM_SIMPLE_EDGE]
                              PARTITION_ONLY_SHUFFLE [RS_10]
                                Select Operator [SEL_3] (rows=26 width=121)
                                  Output:["_col0","_col1"]
                                  TableScan [TS_2] (rows=26 width=121)
                                    
default@part,pp,Tbl:COMPLETE,Col:NONE,Output:["p_type","p_size"]
                            <-Reducer 7 [CUSTOM_SIMPLE_EDGE]
                              PARTITION_ONLY_SHUFFLE [RS_11]
                                Group By Operator [GBY_8] (rows=13 width=121)
                                  Output:["_col0"],keys:KEY._col0
                                <-Map 6 [SIMPLE_EDGE]
                                  SHUFFLE [RS_7]
                                    PartitionCols:_col0
                                    Group By Operator [GBY_6] (rows=26 
width=121)
                                      Output:["_col0"],keys:p_type
                                      TableScan [TS_4] (rows=26 width=121)
                                        
default@part,part,Tbl:COMPLETE,Col:NONE,Output:["p_type"]
{noformat}

First outer table p is joined with inner table pp to come up with all possible 
values where correlated condition {{pp.p_type <> p.p_type}} is true. Then group 
by is done on this to do count(). This is then further joined (Left outer) with 
outer table with a filter on top taking care of case when right side is null 
(count is suppose to return 0 instead of null for empty rows) along with 
whatever subquery condition there is.

Briefly we re-write such queries into left outer join with a filter on top. 
Hope this makes sense.

> Allow correlated scalar subqueries with aggregates which has non-equi join 
> predicates
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-15758
>                 URL: https://issues.apache.org/jira/browse/HIVE-15758
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Logical Optimizer
>            Reporter: Vineet Garg
>            Assignee: Vineet Garg
>              Labels: sub-query
>         Attachments: HIVE-15758.1.patch, HIVE-15758.2.patch
>
>
> Queries such as 
> {code} select * from part where p_size <> (select count(p_size) from part pp 
> where part.p_type <> pp.p_type); {code} are currently not allowed since HIVE 
> doesn't know how to rewrite such queries to preserve the correctness for 
> cases when there is zero row



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-15758) Allow correlated scalar subqueries with aggregates which has non-equi join predicates

Reply via email to