[ 
https://issues.apache.org/jira/browse/HIVE-23684?focusedWorklogId=535539&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-535539
 ]

ASF GitHub Bot logged work on HIVE-23684:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Jan/21 16:46
            Start Date: 13/Jan/21 16:46
    Worklog Time Spent: 10m 
      Work Description: vineetgarg02 merged pull request #1786:
URL: https://github.com/apache/hive/pull/1786


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 535539)
    Time Spent: 50m  (was: 40m)

> Large underestimation in NDV stats when input and join cardinality ratio is 
> big
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-23684
>                 URL: https://issues.apache.org/jira/browse/HIVE-23684
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Large underestimations of NDV values may occur after a join operation since 
> the current logic will decrease the original NDV values proportionally.
> The 
> [code|https://github.com/apache/hive/blob/1271d08a3c51c021fa710449f8748b8cdb12b70f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2558]
>  compares the number of rows of each relation before the join with the number 
> of rows after the join and extracts a ratio for each side. Based on this 
> ratio it adapts (reduces) the NDV accordingly.
> Consider for instance the following query:
> {code:sql}
> select inv_warehouse_sk
>      , inv_item_sk
>      , stddev_samp(inv_quantity_on_hand) stdev
>      , avg(inv_quantity_on_hand)         mean
> from inventory
>    , date_dim
> where inv_date_sk = d_date_sk
>   and d_year = 1999
>   and d_moy = 2
> group by inv_warehouse_sk, inv_item_sk;
> {code}
> For the sake of the discussion, I outline below some relevant stats (from 
> TPCDS30tb):
>  T(inventory) = 1627857000
>  T(date_dim) = 73049
>  T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000
>  V(inventory, inv_date_sk) = 261
>  V(inventory, inv_item_sk) = 420000
>  V(inventory, inv_warehouse_sk) = 27
>  V(date_dim, inv, d_date_sk) = 73049
> For instance, in this query the join between inventory and date_dim has ~24M 
> rows while inventory has ~1.5B so the NDV of the columns coming from 
> inventory are reduced by a factor of ~100 so we end up with V(JOIN, 
> inv_item_sk) = ~6K while the real one is 231000.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to