[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=832101&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-832101 ]
ASF GitHub Bot logged work on HIVE-26221: ----------------------------------------- Author: ASF GitHub Bot Created on: 08/Dec/22 14:56 Start Date: 08/Dec/22 14:56 Worklog Time Spent: 10m Work Description: asolimando commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1043447078 ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/stats/annotation/StatsRulesProcFactory.java: ########## @@ -834,6 +844,36 @@ private long evaluateBetweenExpr(Statistics stats, ExprNodeDesc pred, long currN return currNumRows; } + try { + if (comparisonExpression instanceof ExprNodeColumnDesc) { + final ExprNodeColumnDesc columnDesc = (ExprNodeColumnDesc) comparisonExpression; + ColStatistics cs = stats.getColumnStatisticsFromColName(columnDesc.getColumn()); + if (FilterSelectivityEstimator.isHistogramAvailable(cs)) { + final KllFloatsSketch kll = KllFloatsSketch.heapify(Memory.wrap(cs.getHistogram())); + final String colTypeLowerCase = columnDesc.getTypeString().toLowerCase(); + final String leftValueString = leftExpression instanceof ExprNodeConstantDesc + ? ((ExprNodeConstantDesc) leftExpression).getValue().toString() : leftExpression.getExprString(); + final String rightValueString = rightExpression instanceof ExprNodeConstantDesc + ? ((ExprNodeConstantDesc) rightExpression).getValue().toString() : rightExpression.getExprString(); + final float leftValue = extractFloatFromLiteralValue(colTypeLowerCase, leftValueString); + final float rightValue = extractFloatFromLiteralValue(colTypeLowerCase, rightValueString); + if (invert) { + // column < leftValue OR column > rightValue + if (rightValue < leftValue) { + return kll.getN(); + } + return Math.round(kll.getN() * (lessThanSelectivity(kll, leftValue) + greaterThanSelectivity(kll, rightValue))); + } + // if they are equal we can't handle it here, it becomes an equality predicate + if (Float.compare(leftValue, rightValue) != 0) { + return Math.round(kll.getN() * FilterSelectivityEstimator.betweenSelectivity(kll, leftValue, rightValue)); + } + } + } + } catch(IllegalArgumentException e) { Review Comment: In theory `Timestamp.valueOf()` and `Date.valueOf()` only generates `IllegalArgumentException`, while all the numeric data types throws `NumberFormatException` which extends `IllegalArgumentException`. However, `Float.parseFloat()`, `Double.parseDouble()`, `new BigDecimal()`, `Timestamp.valueOf()` and `Date.valueOf()` throw an `NullPointerException` when the input string is `null`, we can check for that and throw an `IllegalArgumentException`. Since as a safety net we can always resort to the standard computation, I have nothing against catching a more general exception, but in that case I think it's better `catch (RuntimeException e) {...}` so we don't catch and ignore stuff like `InterruptedException` which would be pretty bad. Anyway, if they are not in the method signature, they must inherit from `RuntimeException`, so I think it' safe. I will update the unit tests to cover this case too. Issue Time Tracking ------------------- Worklog Id: (was: 832101) Time Spent: 7h (was: 6h 50m) > Add histogram-based column statistics > ------------------------------------- > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > Labels: pull-request-available > Time Spent: 7h > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)