[ https://issues.apache.org/jira/browse/HIVE-27082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Seonggon Namgung reassigned HIVE-27082: --------------------------------------- Assignee: Seonggon Namgung (was: Sungwoo Park) > AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of > default partition name > ---------------------------------------------------------------------------------------------------- > > Key: HIVE-27082 > URL: https://issues.apache.org/jira/browse/HIVE-27082 > Project: Hive > Issue Type: Improvement > Components: Standalone Metastore > Affects Versions: 3.1.3, 4.0.0-alpha-2 > Reporter: Sungwoo Park > Assignee: Seonggon Namgung > Priority: Major > Labels: pull-request-available > > This JIRA deals with non-determinisitic behavior of Hive in generating DAGs. > The non-determinstic behavior of Hive in generating DAGs is due to the logic > in AggregateStatsCache.findBestMatch() called from AggregateStatsCache.get(), > as well as the disproportionate distribution of Nulls in > HIVE_DEFAULT_PARTITION. > Here is what is happening in the case of the TPC-DS dataset. Let us use > web_sales table and ws_web_site_sk column in the 10TB TPC-DS dataset as a > running example. > In the course of running TPC-DS queries, Hive asks MetaStore about the column > statistics of 1823 partNames in the web_sales/ws_web_site_sk combination, > either without HIVE_DEFAULT_PARTITION or with HIVE_DEFAULT_PARTITION. > --- Without HIVE_DEFAULT_PARTITION, it reports a total of 901180 nulls. > --- With HIVE_DEFAULT_PARTITION, however, it report a total of 1800087 nulls, > almost twice as many. > The first call to MetaStore returns the correct result, but all subsequent > requests are likely to return the same result from the cache, irrespective of > the inclusion of HIVE_DEFAULT_PARTITION. This is because > AggregateStatsCache.findBestMatch() treats HIVE_DEFAULT_PARTITION in the same > way as other partNames, and the difference in the size of partNames[] is just > 1. The outcome depends on the duration of intervening queries, so everything > is now non-deterministic. > If a wrong value of numNulls is returned, Hive generates a different DAG > which make takes much longer than the correct one. The problem is > particularly pronounced here because of the huge number of nulls in > HIVE_DEFAULT_PARTITION. It is ironic to see that the query optimizer is so > efficient that a single wrong guess of numNulls creates a very inefficient > DAG. > Note that this behavior cannot be avoided by setting > hive.metastore.aggregate.stats.cache.max.variance to zero because the > difference in the number of partNames[] between the argument and the entry in > the cache is just 1. > So, AggregateStatsCache.findBestMatch() should treat HIVE_DEFAULT_PARTITION > in a special way, by not returning the result in the cache if there is a > difference in the inclusion of partName HIVE_DEFAULT_PARTITION (or should > provide the use with an option to activate this feature). -- This message was sent by Atlassian Jira (v8.20.10#820010)