[ https://issues.apache.org/jira/browse/HIVE-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mostafa Mokhtar updated HIVE-7913: ---------------------------------- Assignee: Laljo John Pullokkaran > Simplify predicates for CBO > --------------------------- > > Key: HIVE-7913 > URL: https://issues.apache.org/jira/browse/HIVE-7913 > Project: Hive > Issue Type: Bug > Components: CBO > Affects Versions: 0.13.1 > Reporter: Mostafa Mokhtar > Assignee: Laljo John Pullokkaran > Fix For: 0.14.0 > > > I noticed that the estimate number of rows in Map joins is higher after the > join than before the join that is with column stats fetch ON or OFF. > TPC-DS Q55 was a good example for that, the issue is that the current > statistics provide us enough information that we can estimate with strong > confidence that the joins are one to many and not many to many. > Joining store_sales x item on ss_item_sk = i_item_sk, we know that the NDV, > min and max values for both join columns match while the row counts are > different this pattern indicates a PK/FK relationship between store_sales and > item. > Yet when a filter is applied on item and reduces the number of rows from 462K > to 7K we estimate a many to many join between the filtered item and > store_sales and as a result the estimate number of rows coming out of the > join is off by several orders of magnitude. > Available information from the stats > {code} > Table Join column NDV from describe NDV actual > min max > item i_item_sk 439,501 462,000 > 1 462,000 > date_dim d_date_sk 65,332 73,049 > 2,415,022 2,488,070 > store_sales ss_item_sk 439,501 462,000 > 1 462,000 > store_sales ss_sold_date_sk 2,226 1,823 > 2,450,816 2,452,642 > {code} > Same thing applies to store_sales and date_dim but with a caveat that the NDV > , min and max values don't match where date_dim has a bigger domain and > accordingly a higher NDV count. > For joining store_sales and item on on ss_item_sk = i_item_sk since both > columns have the same NDV, min and max values we can safely conclude that > selectivity on item will translate to similar selectivity on store_sales. > This is not the case for joining store_sales and date_dim on ss_sold_date_sk > = d_date_sk since the domain of d_date_sk is much bigger than that of > ss_sold_date_sk, differences in domain need to be taken into account when > inferring selectivity onto store_sales. -- This message was sent by Atlassian JIRA (v6.2#6252)