[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=832837&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-832837 ]
ASF GitHub Bot logged work on HIVE-26221: ----------------------------------------- Author: ASF GitHub Bot Created on: 12/Dec/22 17:58 Start Date: 12/Dec/22 17:58 Worklog Time Spent: 10m Work Description: amansinha100 commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1046190045 ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java: ########## @@ -1234,17 +1285,70 @@ private long evaluateComparator(Statistics stats, AnnotateStatsProcCtx aspCtx, E // new estimate for the number of rows return Math.round( ((maxValue.subtract(value)).divide(maxValue.subtract(minValue), RoundingMode.UP)) - .multiply(BigDecimal.valueOf(numRows)) + .multiply(BigDecimal.valueOf(currNumRows)) .doubleValue()); } } } } catch (NumberFormatException nfe) { - return numRows / 3; + return currNumRows / 3; } } // default - return numRows / 3; + return currNumRows / 3; + } + + private long evaluateComparatorWithHistogram(ColStatistics cs, long currNumRows, String colTypeLowerCase, + String boundValue, boolean upperBound, boolean closedBound) { + final KllFloatsSketch kll = KllFloatsSketch.heapify(Memory.wrap(cs.getHistogram())); + + if (kll.getN() == 0) { + return 0; + } + + try { + final float value = extractFloatFromLiteralValue(colTypeLowerCase, boundValue); + + // kll ignores null values (i.e., kll.getN() + numNulls = currNumRows), we therefore need to use kll.getN() + // instead of currNumRows since the CDF is expressed as a fraction of kll.getN(), not currNumRows + if (upperBound) { + return Math.round(kll.getN() * (closedBound ? + lessThanOrEqualSelectivity(kll, value) : lessThanSelectivity(kll, value))); + } else { + return Math.round(kll.getN() * (closedBound ? + greaterThanOrEqualSelectivity(kll, value) : greaterThanSelectivity(kll, value))); + } + } catch (RuntimeException e) { + LOG.debug("Selectivity computation using histogram failed to parse the boundary value ({}), " + + ", using the generic computation strategy", boundValue, e); + return currNumRows / 3; + } + } + + @VisibleForTesting + protected static float extractFloatFromLiteralValue(String colTypeLowerCase, String value) { + if (colTypeLowerCase.equals(serdeConstants.TINYINT_TYPE_NAME)) { + return Byte.parseByte(value); + } else if (colTypeLowerCase.equals(serdeConstants.SMALLINT_TYPE_NAME)) { + return Short.parseShort(value); + } else if (colTypeLowerCase.equals(serdeConstants.INT_TYPE_NAME)) { + return Integer.parseInt(value); + } else if (colTypeLowerCase.equals(serdeConstants.BIGINT_TYPE_NAME)) { + return Long.parseLong(value); + } else if (colTypeLowerCase.equals(serdeConstants.FLOAT_TYPE_NAME)) { + return Float.parseFloat(value); + } else if (colTypeLowerCase.equals(serdeConstants.DOUBLE_TYPE_NAME)) { + return (float) Double.parseDouble(value); + } else if (colTypeLowerCase.startsWith(serdeConstants.DECIMAL_TYPE_NAME)) { + return new BigDecimal(value).floatValue(); Review Comment: Ah, that sounds fine then. I see the rationale for naming it extractFloat.. Issue Time Tracking ------------------- Worklog Id: (was: 832837) Time Spent: 10.5h (was: 10h 20m) > Add histogram-based column statistics > ------------------------------------- > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > Labels: pull-request-available > Time Spent: 10.5h > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)