[ https://issues.apache.org/jira/browse/HIVE-23684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vineet Garg reassigned HIVE-23684: ---------------------------------- Assignee: Vineet Garg (was: Stamatis Zampetakis) > Large underestimation in NDV stats when input and join cardinality ratio is > big > ------------------------------------------------------------------------------- > > Key: HIVE-23684 > URL: https://issues.apache.org/jira/browse/HIVE-23684 > Project: Hive > Issue Type: Bug > Reporter: Stamatis Zampetakis > Assignee: Vineet Garg > Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Large underestimations of NDV values may occur after a join operation since > the current logic will decrease the original NDV values proportionally. > The > [code|https://github.com/apache/hive/blob/1271d08a3c51c021fa710449f8748b8cdb12b70f/ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java#L2558] > compares the number of rows of each relation before the join with the number > of rows after the join and extracts a ratio for each side. Based on this > ratio it adapts (reduces) the NDV accordingly. > Consider for instance the following query: > {code:sql} > select inv_warehouse_sk > , inv_item_sk > , stddev_samp(inv_quantity_on_hand) stdev > , avg(inv_quantity_on_hand) mean > from inventory > , date_dim > where inv_date_sk = d_date_sk > and d_year = 1999 > and d_moy = 2 > group by inv_warehouse_sk, inv_item_sk; > {code} > For the sake of the discussion, I outline below some relevant stats (from > TPCDS30tb): > T(inventory) = 1627857000 > T(date_dim) = 73049 > T(inventory JOIN date_dim[d_year=1999 AND d_moy=2]) = 24948000 > V(inventory, inv_date_sk) = 261 > V(inventory, inv_item_sk) = 420000 > V(inventory, inv_warehouse_sk) = 27 > V(date_dim, inv, d_date_sk) = 73049 > For instance, in this query the join between inventory and date_dim has ~24M > rows while inventory has ~1.5B so the NDV of the columns coming from > inventory are reduced by a factor of ~100 so we end up with V(JOIN, > inv_item_sk) = ~6K while the real one is 231000. -- This message was sent by Atlassian Jira (v8.3.4#803005)