Stamatis Zampetakis created HIVE-29479:
------------------------------------------

             Summary: Improve histogram-based selectivity estimation for 
two-sided range predicates
                 Key: HIVE-29479
                 URL: https://issues.apache.org/jira/browse/HIVE-29479
             Project: Hive
          Issue Type: Improvement
          Components: CBO
            Reporter: Stamatis Zampetakis


HIVE-26221 introduced histogram statistics to improve selectivity estimation 
for range predicates.

The current estimation logic works reasonably well for:
 * Single-sided range predicates (e.g., {{age > 30}})
 * Closed bounded ranges (e.g., {{age BETWEEN 30 AND 40}})

However, selectivity estimation for general two-sided range predicates (i.e., 
predicates with both lower and upper bounds) can still be significantly 
improved.

Consider the following examples:
 * {{(30, 40]}} → {{age > 30 AND age <= 40}}
 * {{(30, 40)}} → {{age > 30 AND age < 40}}
 * {{[30, 40)}} → {{age >= 30 AND age < 40}}

Currently, the selectivity of such predicates is computed by multiplying the 
selectivity of each conjunct independently.

For example, assume a dataset containing integer ages from 1 to 100 (inclusive).

For the predicate {{(30, 40]}}:
{noformat}
sel(age > 30) = 70 / 100 = 0.7
sel(age <= 40) = 40 / 100 = 0.4
Estimated selectivity = 0.7 * 0.4 = 0.28
{noformat}
However, the actual selectivity is: 10 / 100 = 0.1

The current approach implicitly assumes independence between the lower and 
upper bound predicates, which leads to significant overestimation for narrow 
ranges.

Since histogram statistics are already available, the correct selectivity can 
be derived directly from the cumulative distribution defined by the histogram 
bucket boundaries, instead of multiplying the two sides independently.

*Proposed Improvement:*

Enhance 
[FilterSelectivityEstimator|https://github.com/apache/hive/blob/fb5e8703d915ec045a3c0148a765c9554b22bfe9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L111]
 to:
 * Detect two-sided range predicates.
 * Compute selectivity using histogram cumulative distribution rather than 
independent multiplication.
 * Correctly handle all combinations of inclusive and exclusive bounds.

Most of the required logic is already present. The improvement should primarily 
involve refactoring and generalizing the existing range-handling code.

The {{SEARCH}} operator internally represents ranges and can be leveraged to 
make the implementation more generic and robust.

As part of this change, the current {{SEARCH}} expansion performed by 
{{SearchTransformer}} inside 
[FilterSelectivityEstimator|https://github.com/apache/hive/blob/fb5e8703d915ec045a3c0148a765c9554b22bfe9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L111]
 will likely need to be removed or reworked to preserve range semantics during 
estimation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to