[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=832760&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-832760 ]
ASF GitHub Bot logged work on HIVE-26221: ----------------------------------------- Author: ASF GitHub Bot Created on: 12/Dec/22 14:32 Start Date: 12/Dec/22 14:32 Worklog Time Spent: 10m Work Description: asolimando commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1045899791 ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java: ########## @@ -1234,17 +1285,70 @@ private long evaluateComparator(Statistics stats, AnnotateStatsProcCtx aspCtx, E // new estimate for the number of rows return Math.round( ((maxValue.subtract(value)).divide(maxValue.subtract(minValue), RoundingMode.UP)) - .multiply(BigDecimal.valueOf(numRows)) + .multiply(BigDecimal.valueOf(currNumRows)) .doubleValue()); } } } } catch (NumberFormatException nfe) { - return numRows / 3; + return currNumRows / 3; } } // default - return numRows / 3; + return currNumRows / 3; + } + + private long evaluateComparatorWithHistogram(ColStatistics cs, long currNumRows, String colTypeLowerCase, + String boundValue, boolean upperBound, boolean closedBound) { + final KllFloatsSketch kll = KllFloatsSketch.heapify(Memory.wrap(cs.getHistogram())); + + if (kll.getN() == 0) { + return 0; + } + + try { + final float value = extractFloatFromLiteralValue(colTypeLowerCase, boundValue); + + // kll ignores null values (i.e., kll.getN() + numNulls = currNumRows), we therefore need to use kll.getN() + // instead of currNumRows since the CDF is expressed as a fraction of kll.getN(), not currNumRows + if (upperBound) { + return Math.round(kll.getN() * (closedBound ? + lessThanOrEqualSelectivity(kll, value) : lessThanSelectivity(kll, value))); + } else { + return Math.round(kll.getN() * (closedBound ? + greaterThanOrEqualSelectivity(kll, value) : greaterThanSelectivity(kll, value))); + } + } catch (RuntimeException e) { + LOG.debug("Selectivity computation using histogram failed to parse the boundary value ({}), " + + ", using the generic computation strategy", boundValue, e); + return currNumRows / 3; + } + } + + @VisibleForTesting + protected static float extractFloatFromLiteralValue(String colTypeLowerCase, String value) { + if (colTypeLowerCase.equals(serdeConstants.TINYINT_TYPE_NAME)) { + return Byte.parseByte(value); + } else if (colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)) { + return Short.parseShort(value); + } else if (colTypeLowerCase.equals(serdeConstants.INT_TYPE_NAME)) { + return Integer.parseInt(value); + } else if (colTypeLowerCase.equals(serdeConstants.BIGINT_TYPE_NAME)) { + return Long.parseLong(value); + } else if (colTypeLowerCase.equals(serdeConstants.FLOAT_TYPE_NAME)) { + return Float.parseFloat(value); + } else if (colTypeLowerCase.equals(serdeConstants.DOUBLE_TYPE_NAME)) { + return (float) Double.parseDouble(value); + } else if (colTypeLowerCase.startsWith(serdeConstants.DECIMAL_TYPE_NAME)) { + return new BigDecimal(value).floatValue(); Review Comment: Unfortunately KLL is based on Float values for storing (`kll.update()`) and querying (including `kll.cdf()`), so for data types with higher capacity than `float` like `double` or `long`, we need to take a hit, so I'd rather make this explicit here rather than carrying around a higher precision information which will need to be casted to `float` later on. For the same reason, I have named the method `extractFloatFromLiteralValue`, to state clearly that we are extracting a `float`, no matter the data type of the literal. This is unfortunate, but we are dealing with statistics and the approximated result coming from KLL will be hopefully an improvement over the existing hard-coded estimation anyway. Issue Time Tracking ------------------- Worklog Id: (was: 832760) Time Spent: 9h 50m (was: 9h 40m) > Add histogram-based column statistics > ------------------------------------- > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > Labels: pull-request-available > Time Spent: 9h 50m > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)