[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics

ASF GitHub Bot (Jira) Thu, 08 Dec 2022 18:20:49 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=832215&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-832215
 ]


ASF GitHub Bot logged work on HIVE-26221:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 09/Dec/22 02:19
            Start Date: 09/Dec/22 02:19
    Worklog Time Spent: 10m 
      Work Description: dengzhhu653 commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1044002365


##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java:
##########
@@ -167,6 +178,109 @@ public Double visitCall(RexCall call) {
     return selectivity;
   }
 
+  private double computeRangePredicateSelectivity(RexCall call, SqlKind op) {
+    final boolean isLiteralLeft = 
call.getOperands().get(0).getKind().equals(SqlKind.LITERAL);
+    final boolean isLiteralRight = 
call.getOperands().get(1).getKind().equals(SqlKind.LITERAL);
+    final boolean isInputRefLeft = 
call.getOperands().get(0).getKind().equals(SqlKind.INPUT_REF);
+    final boolean isInputRefRight = 
call.getOperands().get(1).getKind().equals(SqlKind.INPUT_REF);
+
+    if (childRel instanceof HiveTableScan && isLiteralLeft != isLiteralRight 
&& isInputRefLeft != isInputRefRight) {
+      final HiveTableScan t = (HiveTableScan) childRel;
+      final int inputRefIndex = ((RexInputRef) 
call.getOperands().get(isInputRefLeft ? 0 : 1)).getIndex();
+      final List<ColStatistics> colStats = 
t.getColStat(Collections.singletonList(inputRefIndex));
+
+      if (!colStats.isEmpty() && isHistogramAvailable(colStats.get(0))) {
+        final KllFloatsSketch kll = 
KllFloatsSketch.heapify(Memory.wrap(colStats.get(0).getHistogram()));
+        final Object boundValueObject = ((RexLiteral) 
call.getOperands().get(isLiteralLeft ? 0 : 1)).getValue();
+        final SqlTypeName typeName = call.getOperands().get(isInputRefLeft ? 0 
: 1).getType().getSqlTypeName();
+        float value = extractLiteral(typeName, boundValueObject);
+        boolean closedBound = op.equals(SqlKind.LESS_THAN_OR_EQUAL) || 
op.equals(SqlKind.GREATER_THAN_OR_EQUAL);
+
+        double selectivity;
+        if (op.equals(SqlKind.LESS_THAN_OR_EQUAL) || 
op.equals(SqlKind.LESS_THAN)) {
+          selectivity = closedBound ? lessThanOrEqualSelectivity(kll, value) : 
lessThanSelectivity(kll, value);
+        } else {
+          selectivity = closedBound ? greaterThanOrEqualSelectivity(kll, 
value) : greaterThanSelectivity(kll, value);
+        }
+
+        // selectivity does not account for null values, we multiply for the 
number of non-null values (getN) and we
+        // divide by the total (non-null + null values) to get the overall 
selectivity
+        return kll.getN() * selectivity / t.getTable().getRowCount();

Review Comment:
   I mean for a filter a < 3, if in table there are some rows for column a:
   1
   null
   null
   3
   4
   after filter the are only one row left, so the filter selectivity is 1/5, 
not (1+2(null))/5.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 832215)
    Time Spent: 7.5h  (was: 7h 20m)

> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> Hive does not support histogram statistics, which are particularly useful for 
> skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a 
> hard-coded value of 1/3 (see 
> [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column 
> statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form 
> table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data 
> sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. 
> Datasketches are small, stateful programs that process massive data-streams 
> and can provide approximate answers, with mathematical guarantees, to 
> computationally difficult queries orders-of-magnitude faster than 
> traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution 
> function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric 
> families) and temporal data types (date and timestamp).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics

Reply via email to