[jira] [Created] (HIVE-13254) GBY cardinality estimation is wrong partition columns is involved

Prasanth Jayachandran (JIRA) Wed, 09 Mar 2016 17:36:12 -0800

Prasanth Jayachandran created HIVE-13254:
--------------------------------------------


             Summary: GBY cardinality estimation is wrong partition columns is 
involved
                 Key: HIVE-13254
                 URL: https://issues.apache.org/jira/browse/HIVE-13254
             Project: Hive
          Issue Type: Bug
          Components: Hive
    Affects Versions: 2.0.0, 1.3.0, 2.1.0
            Reporter: Prasanth Jayachandran
         Attachments: q3_ef_transpose_aggr.svg

When running the following query on TPCDS-1000 scale, setting 
hive.transpose.aggr.join=true is expected to generate optimal plan but it was 
not generating. 

{code:title=Query}
SELECT `date_dim`.`d_day_name` AS `d_day_name`, 
       `item`.`i_category`     AS `i_category` 
FROM   `store_sales` `store_sales` 
       INNER JOIN `item` `item` 
               ON ( `store_sales`.`ss_item_sk` = `item`.`i_item_sk` ) 
       INNER JOIN `date_dim` `date_dim` 
               ON ( `store_sales`.`ss_sold_date_sk` = `date_dim`.`d_date_sk` ) 
GROUP  BY `d_day_name`, 
          `i_category`;
{code}

The reason for that is stats annotation rule for GROUP BY is not considering 
partition column into account. For the above query, the generated plan is 
attached. As we can see from the plan, GBY is pushed to fact table 
(store_sales) but that output of GBY shuffled to perform join instead of 
MapJoin conversion. This is because of wrong estimation of cardinality/data 
size of GBY on store_sales (Map 1). 

What's happening internally is, GBY computes estimated cardinality which in 
this case is NDV(ss_item_sk) * NDV(ss_sold_date_sk) = 338901 * 1823 ~= 617M. 
This estimate is wrong as ss_sold_date_sk is partition column and estimator 
assumes its non-partition column. In this case, not every tasks reads data from 
all partitions. We need to take estimated task parallelism into account. For 
example: If task parallelism is determined to be 100 the estimate from GBY 
should be ~6M which should convert this vertex into map join vertex. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HIVE-13254) GBY cardinality estimation is wrong partition columns is involved

Reply via email to