[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=830981&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-830981 ]
ASF GitHub Bot logged work on HIVE-26221: ----------------------------------------- Author: ASF GitHub Bot Created on: 05/Dec/22 10:14 Start Date: 05/Dec/22 10:14 Worklog Time Spent: 10m Work Description: asolimando commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1039397581 ########## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/columnstats/aggr/LongColumnStatsAggregator.java: ########## @@ -51,73 +52,94 @@ public ColumnStatisticsObj aggregate(List<ColStatsObjWithSourceInfo> colStatsWit checkStatisticsList(colStatsWithSourceInfo); ColumnStatisticsObj statsObj = null; - String colType = null; + String colType; String colName = null; // check if all the ColumnStatisticsObjs contain stats and all the ndv are // bitvectors boolean doAllPartitionContainStats = partNames.size() == colStatsWithSourceInfo.size(); NumDistinctValueEstimator ndvEstimator = null; + KllHistogramEstimator histogramEstimator = null; + boolean areAllNDVEstimatorsMergeable = true; + boolean areAllHistogramEstimatorsMergeable = true; for (ColStatsObjWithSourceInfo csp : colStatsWithSourceInfo) { ColumnStatisticsObj cso = csp.getColStatsObj(); if (statsObj == null) { colName = cso.getColName(); colType = cso.getColType(); statsObj = ColumnStatsAggregatorFactory.newColumnStaticsObj(colName, colType, cso.getStatsData().getSetField()); - LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName, - doAllPartitionContainStats); + LOG.trace("doAllPartitionContainStats for column: {} is: {}", colName, doAllPartitionContainStats); } - LongColumnStatsDataInspector longColumnStatsData = longInspectorFromStats(cso); - if (longColumnStatsData.getNdvEstimator() == null) { - ndvEstimator = null; - break; - } else { - // check if all of the bit vectors can merge - NumDistinctValueEstimator estimator = longColumnStatsData.getNdvEstimator(); + LongColumnStatsDataInspector columnStatsData = longInspectorFromStats(cso); + + // check if we can merge NDV estimators + if (columnStatsData.getNdvEstimator() == null) { + areAllNDVEstimatorsMergeable = false; + } else if (areAllNDVEstimatorsMergeable) { + NumDistinctValueEstimator estimator = columnStatsData.getNdvEstimator(); if (ndvEstimator == null) { ndvEstimator = estimator; } else { - if (ndvEstimator.canMerge(estimator)) { - continue; - } else { - ndvEstimator = null; - break; + if (!ndvEstimator.canMerge(estimator)) { + areAllNDVEstimatorsMergeable = false; + } + } + } + // check if we can merge histogram estimators + if (columnStatsData.getHistogramEstimator() == null) { Review Comment: You are right, I have double checked and indeed it can be simplified as you suggest, I have added those fours lines you have cited right before the `return` statement of `aggregate` method and it's enough. Issue Time Tracking ------------------- Worklog Id: (was: 830981) Time Spent: 3.5h (was: 3h 20m) > Add histogram-based column statistics > ------------------------------------- > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > Labels: pull-request-available > Time Spent: 3.5h > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)