[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=831286&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831286 ]
ASF GitHub Bot logged work on HIVE-26221: ----------------------------------------- Author: ASF GitHub Bot Created on: 06/Dec/22 07:28 Start Date: 06/Dec/22 07:28 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1040592119 ########## standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/StatisticsTestUtils.java: ########## @@ -109,4 +135,116 @@ public static HyperLogLog createHll(String... values) { } return hll; } + + /** + * Creates an HLL object initialized with the given values. + * @param values the values to be added + * @return an HLL object initialized with the given values. + */ + public static HyperLogLog createHll(double... values) { + HyperLogLog hll = HyperLogLog.builder().build(); + Arrays.stream(values).forEach(hll::addDouble); + return hll; + } + + /** + * Creates a KLL object initialized with the given values. + * @param values the values to be added + * @return a KLL object initialized with the given values. + */ + public static KllFloatsSketch createKll(float... values) { + KllFloatsSketch kll = new KllFloatsSketch(); + for (float value : values) { + kll.update(value); + } + return kll; + } + + /** + * Creates a KLL object initialized with the given values. + * @param values the values to be added + * @return a KLL object initialized with the given values. + */ + public static KllFloatsSketch createKll(double... values) { + KllFloatsSketch kll = new KllFloatsSketch(); + for (double value : values) { + kll.update(Double.valueOf(value).floatValue()); + } + return kll; + } + + /** + * Creates a KLL object initialized with the given values. + * @param values the values to be added + * @return a KLL object initialized with the given values. + */ + public static KllFloatsSketch createKll(long... values) { + KllFloatsSketch kll = new KllFloatsSketch(); + for (long value : values) { + kll.update(value); + } + return kll; + } + + /** + * Checks if expected and computed statistics data are equal. + * @param expected expected statistics data + * @param computed computed statistics data + */ + public static void assertEqualStatistics(ColumnStatisticsData expected, ColumnStatisticsData computed) { + if (expected.getSetField() != computed.getSetField()) { + throw new IllegalArgumentException("Expected data is of type " + expected.getSetField() + + " while computed data is of type " + computed.getSetField()); + } + + Class<?> dataClass = null; + switch (expected.getSetField()) { + case DATE_STATS: + dataClass = DateColumnStatsData.class; + break; + case LONG_STATS: + dataClass = LongColumnStatsData.class; + break; + case DOUBLE_STATS: + dataClass = DoubleColumnStatsData.class; + break; + case DECIMAL_STATS: + dataClass = DecimalColumnStatsData.class; + break; + case TIMESTAMP_STATS: + dataClass = TimestampColumnStatsData.class; + break; + default: + // it's an unsupported class for KLL, no special treatment needed + Assert.assertEquals(expected, computed); + return; + } + assertEqualStatistics(expected, computed, dataClass); + } + + private static <X> void assertEqualStatistics( Review Comment: This function only compares the `histogram`, and does not tell much truth when either `computedHasHistograms` or `expectedHasHistograms` is false. Cloud we compare the `ColumnStatisticsData` by `Assert.assertEquals(expected, computed);` as we did in Line 219? Issue Time Tracking ------------------- Worklog Id: (was: 831286) Time Spent: 4.5h (was: 4h 20m) > Add histogram-based column statistics > ------------------------------------- > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > Labels: pull-request-available > Time Spent: 4.5h > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)