[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics

ASF GitHub Bot (Jira) Tue, 06 Dec 2022 08:55:41 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=831491&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831491
 ]


ASF GitHub Bot logged work on HIVE-26221:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Dec/22 16:54
            Start Date: 06/Dec/22 16:54
    Worklog Time Spent: 10m 
      Work Description: asolimando commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1041223672


##########
standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/StatisticsTestUtils.java:
##########
@@ -109,4 +135,116 @@ public static HyperLogLog createHll(String... values) {
     }
     return hll;
   }
+
+  /**
+   * Creates an HLL object initialized with the given values.
+   * @param values the values to be added
+   * @return an HLL object initialized with the given values.
+   */
+  public static HyperLogLog createHll(double... values) {
+    HyperLogLog hll = HyperLogLog.builder().build();
+    Arrays.stream(values).forEach(hll::addDouble);
+    return hll;
+  }
+
+  /**
+   * Creates a KLL object initialized with the given values.
+   * @param values the values to be added
+   * @return a KLL object initialized with the given values.
+   */
+  public static KllFloatsSketch createKll(float... values) {
+    KllFloatsSketch kll = new KllFloatsSketch();
+    for (float value : values) {
+      kll.update(value);
+    }
+    return kll;
+  }
+
+  /**
+   * Creates a KLL object initialized with the given values.
+   * @param values the values to be added
+   * @return a KLL object initialized with the given values.
+   */
+  public static KllFloatsSketch createKll(double... values) {
+    KllFloatsSketch kll = new KllFloatsSketch();
+    for (double value : values) {
+      kll.update(Double.valueOf(value).floatValue());
+    }
+    return kll;
+  }
+
+  /**
+   * Creates a KLL object initialized with the given values.
+   * @param values the values to be added
+   * @return a KLL object initialized with the given values.
+   */
+  public static KllFloatsSketch createKll(long... values) {
+    KllFloatsSketch kll = new KllFloatsSketch();
+    for (long value : values) {
+      kll.update(value);
+    }
+    return kll;
+  }
+
+  /**
+   * Checks if expected and computed statistics data are equal.
+   * @param expected expected statistics data
+   * @param computed computed statistics data
+   */
+  public static void assertEqualStatistics(ColumnStatisticsData expected, 
ColumnStatisticsData computed) {
+    if (expected.getSetField() != computed.getSetField()) {
+      throw new IllegalArgumentException("Expected data is of type " + 
expected.getSetField()
+          + " while computed data is of type " + computed.getSetField());
+    }
+
+    Class<?> dataClass = null;
+    switch (expected.getSetField()) {
+    case DATE_STATS:
+      dataClass = DateColumnStatsData.class;
+      break;
+    case LONG_STATS:
+      dataClass = LongColumnStatsData.class;
+      break;
+    case DOUBLE_STATS:
+      dataClass = DoubleColumnStatsData.class;
+      break;
+    case DECIMAL_STATS:
+      dataClass = DecimalColumnStatsData.class;
+      break;
+    case TIMESTAMP_STATS:
+      dataClass = TimestampColumnStatsData.class;
+      break;
+    default:
+      // it's an unsupported class for KLL, no special treatment needed
+      Assert.assertEquals(expected, computed);
+      return;
+    }
+    assertEqualStatistics(expected, computed, dataClass);
+  }
+
+  private static <X> void assertEqualStatistics(

Review Comment:
   Unfortunately `KllFloatSketch`'s serialization to byte is sensitive to data 
insertion order (and therefore sketch merging order), even if two sketches are 
functionally the same (the cumulative distribution function results are 
identical, as well as that of any other method), the two byte arrays will 
differ.
   
   The only reason for having created this method is to circumvent that, and 
compare the sketches via their string representation, which is stable w.r.t. 
insertion order etc.
   
   But please note that we do call `Assert.assertEquals(expected, computed)`, 
we do that after comparing the string representation of the two sketches, and 
setting them equal (we restore them back to not alter the input), so we fully 
check everything.
   
   Your comment made me notice that we were not checking anything if one of the 
two input had no histogram, I have handled that case now.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 831491)
    Time Spent: 4h 50m  (was: 4h 40m)

> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Hive does not support histogram statistics, which are particularly useful for 
> skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a 
> hard-coded value of 1/3 (see 
> [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column 
> statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form 
> table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data 
> sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. 
> Datasketches are small, stateful programs that process massive data-streams 
> and can provide approximate answers, with mathematical guarantees, to 
> computationally difficult queries orders-of-magnitude faster than 
> traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution 
> function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric 
> families) and temporal data types (date and timestamp).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics

Reply via email to