[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics

ASF GitHub Bot (Jira) Mon, 05 Dec 2022 02:15:29 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=830981&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830981
 ]


ASF GitHub Bot logged work on HIVE-26221:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Dec/22 10:14
            Start Date: 05/Dec/22 10:14
    Worklog Time Spent: 10m 
      Work Description: asolimando commented on code in PR #3137:
URL: https://github.com/apache/hive/pull/3137#discussion_r1039397581


##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/LongColumnStatsAggregator.java:
##########
@@ -51,73 +52,94 @@ public ColumnStatisticsObj 
aggregate(List<ColStatsObjWithSourceInfo> colStatsWit
     checkStatisticsList(colStatsWithSourceInfo);
 
     ColumnStatisticsObj statsObj = null;
-    String colType = null;
+    String colType;
     String colName = null;
     // check if all the ColumnStatisticsObjs contain stats and all the ndv are
     // bitvectors
     boolean doAllPartitionContainStats = partNames.size() == 
colStatsWithSourceInfo.size();
     NumDistinctValueEstimator ndvEstimator = null;
+    KllHistogramEstimator histogramEstimator = null;
+    boolean areAllNDVEstimatorsMergeable = true;
+    boolean areAllHistogramEstimatorsMergeable = true;
     for (ColStatsObjWithSourceInfo csp : colStatsWithSourceInfo) {
       ColumnStatisticsObj cso = csp.getColStatsObj();
       if (statsObj == null) {
         colName = cso.getColName();
         colType = cso.getColType();
         statsObj = ColumnStatsAggregatorFactory.newColumnStaticsObj(colName, 
colType,
             cso.getStatsData().getSetField());
-        LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName,
-            doAllPartitionContainStats);
+        LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName, 
doAllPartitionContainStats);
       }
-      LongColumnStatsDataInspector longColumnStatsData = 
longInspectorFromStats(cso);
-      if (longColumnStatsData.getNdvEstimator() == null) {
-        ndvEstimator = null;
-        break;
-      } else {
-        // check if all of the bit vectors can merge
-        NumDistinctValueEstimator estimator = 
longColumnStatsData.getNdvEstimator();
+      LongColumnStatsDataInspector columnStatsData = 
longInspectorFromStats(cso);
+
+      // check if we can merge NDV estimators
+      if (columnStatsData.getNdvEstimator() == null) {
+        areAllNDVEstimatorsMergeable = false;
+      } else if (areAllNDVEstimatorsMergeable) {
+        NumDistinctValueEstimator estimator = 
columnStatsData.getNdvEstimator();
         if (ndvEstimator == null) {
           ndvEstimator = estimator;
         } else {
-          if (ndvEstimator.canMerge(estimator)) {
-            continue;
-          } else {
-            ndvEstimator = null;
-            break;
+          if (!ndvEstimator.canMerge(estimator)) {
+            areAllNDVEstimatorsMergeable = false;
+          }
+        }
+      }
+      // check if we can merge histogram estimators
+      if (columnStatsData.getHistogramEstimator() == null) {

Review Comment:
   You are right, I have double checked and indeed it can be simplified as you 
suggest, I have added those fours lines you have cited right before the 
`return` statement of `aggregate` method and it's enough.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 830981)
    Time Spent: 3.5h  (was: 3h 20m)

> Add histogram-based column statistics
> -------------------------------------
>
>                 Key: HIVE-26221
>                 URL: https://issues.apache.org/jira/browse/HIVE-26221
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO, Metastore, Statistics
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Hive does not support histogram statistics, which are particularly useful for 
> skewed data (which is very common in practice) and range predicates.
> Hive's current selectivity estimation for range predicates is based on a 
> hard-coded value of 1/3 (see 
> [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).])
> The current proposal aims at integrating histogram as an additional column 
> statistics, stored into the Hive metastore at the table (or partition) level.
> The main requirements for histogram integration are the following:
>  * efficiency: the approach must scale and support billions of rows
>  * merge-ability: partition-level histograms have to be merged to form 
> table-level histograms
>  * explicit and configurable trade-off between memory footprint and accuracy
> Hive already integrates [KLL data 
> sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. 
> Datasketches are small, stateful programs that process massive data-streams 
> and can provide approximate answers, with mathematical guarantees, to 
> computationally difficult queries orders-of-magnitude faster than 
> traditional, exact methods.
> We propose to use KLL, and more specifically the cumulative distribution 
> function (CDF), as the underlying data structure for our histogram statistics.
> The current proposal targets numeric data types (float, integer and numeric 
> families) and temporal data types (date and timestamp).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work logged] (HIVE-26221) Add histogram-based column statistics

Reply via email to