Mostafa Mokhtar created HIVE-7834:
-------------------------------------

             Summary: Use min, max and NDV from the stats to better estimate 
many to many vs one to many inner joins
                 Key: HIVE-7834
                 URL: https://issues.apache.org/jira/browse/HIVE-7834
             Project: Hive
          Issue Type: Bug
          Components: CBO
    Affects Versions: 0.13.1
            Reporter: Mostafa Mokhtar
             Fix For: 0.14.0


I noticed that the estimate number of rows in Map joins is higher after the 
join than before the join that is with column stats fetch ON or OFF.
TPC-DS Q55 was a good example for that, the issue is that the current 
statistics provide us enough information that we can estimate with strong 
confidence that the joins are one to many and not many to many.

Joining store_sales x item on ss_item_sk = i_item_sk, we know that the NDV, min 
and max values for both join columns match while the row counts are different 
this pattern indicates a PK/FK relationship between store_sales and item.
Yet when a filter is applied on item and reduces the number of rows from 462K 
to 7K we estimate a many to many join between the filtered item and store_sales 
and as a result the estimate number of rows coming out of the join is off by 
several orders of magnitude.

Available information from the stats
{code}
Table           Join column     NDV from describe               NDV actual      
min             max
item            i_item_sk       439,501                         462,000         
1               462,000
date_dim        d_date_sk       65,332                          73,049          
2,415,022       2,488,070
store_sales     ss_item_sk      439,501                         462,000         
1               462,000
store_sales     ss_sold_date_sk 2,226                           1,823           
2,450,816       2,452,642
{code}

Same thing applies to store_sales and date_dim but with a caveat that the NDV , 
min and max values don't match where date_dim has a bigger domain and 
accordingly a higher NDV count.
For joining store_sales and item on on ss_item_sk = i_item_sk since both 
columns have the same NDV, min and max values we can safely conclude that 
selectivity on item will translate to similar selectivity on store_sales.
This is not the case for joining store_sales and date_dim on ss_sold_date_sk = 
d_date_sk since the domain of d_date_sk is much bigger than that of 
ss_sold_date_sk, differences in domain need to be taken into account when 
inferring selectivity onto store_sales.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to