[ 
https://issues.apache.org/jira/browse/HIVE-23684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vineet Garg resolved HIVE-23684.
--------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

> Large underestimation in NDV stats when input and join cardinality ratio is 
> big
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-23684
>                 URL: https://issues.apache.org/jira/browse/HIVE-23684
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Stamatis Zampetakis
>            Assignee: Vineet Garg
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Large underestimations of NDV values may occur after a join operation since 
> the current logic will decrease the original NDV values proportionally.
> The 
> [code|https://github.com/apache/hive/blob/1271d08a3c51c021fa710449f8748b8cdb12b70f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2558]
>  compares the number of rows of each relation before the join with the number 
> of rows after the join and extracts a ratio for each side. Based on this 
> ratio it adapts (reduces) the NDV accordingly.
> Consider for instance the following query:
> {code:sql}
> select inv_warehouse_sk
>      , inv_item_sk
>      , stddev_samp(inv_quantity_on_hand) stdev
>      , avg(inv_quantity_on_hand)         mean
> from inventory
>    , date_dim
> where inv_date_sk = d_date_sk
>   and d_year = 1999
>   and d_moy = 2
> group by inv_warehouse_sk, inv_item_sk;
> {code}
> For the sake of the discussion, I outline below some relevant stats (from 
> TPCDS30tb):
>  T(inventory) = 1627857000
>  T(date_dim) = 73049
>  T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
>  V(inventory, inv_date_sk) = 261
>  V(inventory, inv_item_sk) = 420000
>  V(inventory, inv_warehouse_sk) = 27
>  V(date_dim, inv, d_date_sk) = 73049
> For instance, in this query the join between inventory and date_dim has ~24M 
> rows while inventory has ~1.5B so the NDV of the columns coming from 
> inventory are reduced by a factor of ~100 so we end up with V(JOIN, 
> inv_item_sk) = ~6K while the real one is 231000.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to