[ https://issues.apache.org/jira/browse/HIVE-26221?focusedWorklogId=832215&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-832215 ]
ASF GitHub Bot logged work on HIVE-26221: ----------------------------------------- Author: ASF GitHub Bot Created on: 09/Dec/22 02:19 Start Date: 09/Dec/22 02:19 Worklog Time Spent: 10m Work Description: dengzhhu653 commented on code in PR #3137: URL: https://github.com/apache/hive/pull/3137#discussion_r1044002365 ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java: ########## @@ -167,6 +178,109 @@ public Double visitCall(RexCall call) { return selectivity; } + private double computeRangePredicateSelectivity(RexCall call, SqlKind op) { + final boolean isLiteralLeft = call.getOperands().get(0).getKind().equals(SqlKind.LITERAL); + final boolean isLiteralRight = call.getOperands().get(1).getKind().equals(SqlKind.LITERAL); + final boolean isInputRefLeft = call.getOperands().get(0).getKind().equals(SqlKind.INPUT_REF); + final boolean isInputRefRight = call.getOperands().get(1).getKind().equals(SqlKind.INPUT_REF); + + if (childRel instanceof HiveTableScan && isLiteralLeft != isLiteralRight && isInputRefLeft != isInputRefRight) { + final HiveTableScan t = (HiveTableScan) childRel; + final int inputRefIndex = ((RexInputRef) call.getOperands().get(isInputRefLeft ? 0 : 1)).getIndex(); + final List<ColStatistics> colStats = t.getColStat(Collections.singletonList(inputRefIndex)); + + if (!colStats.isEmpty() && isHistogramAvailable(colStats.get(0))) { + final KllFloatsSketch kll = KllFloatsSketch.heapify(Memory.wrap(colStats.get(0).getHistogram())); + final Object boundValueObject = ((RexLiteral) call.getOperands().get(isLiteralLeft ? 0 : 1)).getValue(); + final SqlTypeName typeName = call.getOperands().get(isInputRefLeft ? 0 : 1).getType().getSqlTypeName(); + float value = extractLiteral(typeName, boundValueObject); + boolean closedBound = op.equals(SqlKind.LESS_THAN_OR_EQUAL) || op.equals(SqlKind.GREATER_THAN_OR_EQUAL); + + double selectivity; + if (op.equals(SqlKind.LESS_THAN_OR_EQUAL) || op.equals(SqlKind.LESS_THAN)) { + selectivity = closedBound ? lessThanOrEqualSelectivity(kll, value) : lessThanSelectivity(kll, value); + } else { + selectivity = closedBound ? greaterThanOrEqualSelectivity(kll, value) : greaterThanSelectivity(kll, value); + } + + // selectivity does not account for null values, we multiply for the number of non-null values (getN) and we + // divide by the total (non-null + null values) to get the overall selectivity + return kll.getN() * selectivity / t.getTable().getRowCount(); Review Comment: I mean for a filter a < 3, if in table there are some rows for column a: 1 null null 3 4 after filter the are only one row left, so the filter selectivity is 1/5, not (1+2(null))/5. Issue Time Tracking ------------------- Worklog Id: (was: 832215) Time Spent: 7.5h (was: 7h 20m) > Add histogram-based column statistics > ------------------------------------- > > Key: HIVE-26221 > URL: https://issues.apache.org/jira/browse/HIVE-26221 > Project: Hive > Issue Type: Improvement > Components: CBO, Metastore, Statistics > Affects Versions: 4.0.0-alpha-2 > Reporter: Alessandro Solimando > Assignee: Alessandro Solimando > Priority: Major > Labels: pull-request-available > Time Spent: 7.5h > Remaining Estimate: 0h > > Hive does not support histogram statistics, which are particularly useful for > skewed data (which is very common in practice) and range predicates. > Hive's current selectivity estimation for range predicates is based on a > hard-coded value of 1/3 (see > [FilterSelectivityEstimator.java#L138-L144|https://github.com/apache/hive/blob/56c336268ea8c281d23c22d89271af37cb7e2572/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L138-L144]).]) > The current proposal aims at integrating histogram as an additional column > statistics, stored into the Hive metastore at the table (or partition) level. > The main requirements for histogram integration are the following: > * efficiency: the approach must scale and support billions of rows > * merge-ability: partition-level histograms have to be merged to form > table-level histograms > * explicit and configurable trade-off between memory footprint and accuracy > Hive already integrates [KLL data > sketches|https://datasketches.apache.org/docs/KLL/KLLSketch.html] UDAF. > Datasketches are small, stateful programs that process massive data-streams > and can provide approximate answers, with mathematical guarantees, to > computationally difficult queries orders-of-magnitude faster than > traditional, exact methods. > We propose to use KLL, and more specifically the cumulative distribution > function (CDF), as the underlying data structure for our histogram statistics. > The current proposal targets numeric data types (float, integer and numeric > families) and temporal data types (date and timestamp). -- This message was sent by Atlassian Jira (v8.20.10#820010)