[ 
https://issues.apache.org/jira/browse/HIVE-29479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062389#comment-18062389
 ] 

Stamatis Zampetakis commented on HIVE-29479:
--------------------------------------------

Exactly, the estimator should not bother with the creation of the SEARCH but if 
the latter is present already we should take advantage of it instead of 
expanding/transforming to something else that leads to inferior estimates.

> Improve histogram-based selectivity estimation for two-sided range predicates
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-29479
>                 URL: https://issues.apache.org/jira/browse/HIVE-29479
>             Project: Hive
>          Issue Type: Improvement
>          Components: CBO
>            Reporter: Stamatis Zampetakis
>            Priority: Major
>
> HIVE-26221 introduced histogram statistics to improve selectivity estimation 
> for range predicates.
> The current estimation logic works reasonably well for:
>  * Single-sided range predicates (e.g., {{age > 30}})
>  * Closed bounded ranges (e.g., {{age BETWEEN 30 AND 40}})
> However, selectivity estimation for general two-sided range predicates (i.e., 
> predicates with both lower and upper bounds) can still be significantly 
> improved.
> Consider the following examples:
>  * {{(30, 40]}} → {{age > 30 AND age <= 40}}
>  * {{(30, 40)}} → {{age > 30 AND age < 40}}
>  * {{[30, 40)}} → {{age >= 30 AND age < 40}}
> Currently, the selectivity of such predicates is computed by multiplying the 
> selectivity of each conjunct independently.
> For example, assume a dataset containing integer ages from 1 to 100 
> (inclusive).
> For the predicate {{(30, 40]}}:
> {noformat}
> sel(age > 30) = 70 / 100 = 0.7
> sel(age <= 40) = 40 / 100 = 0.4
> Estimated selectivity = 0.7 * 0.4 = 0.28
> {noformat}
> However, the actual selectivity is: 10 / 100 = 0.1
> The current approach implicitly assumes independence between the lower and 
> upper bound predicates, which leads to significant overestimation for narrow 
> ranges.
> Since histogram statistics are already available, the correct selectivity can 
> be derived directly from the cumulative distribution defined by the histogram 
> bucket boundaries, instead of multiplying the two sides independently.
> *Proposed Improvement:*
> Enhance 
> [FilterSelectivityEstimator|https://github.com/apache/hive/blob/fb5e8703d915ec045a3c0148a765c9554b22bfe9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L111]
>  to:
>  * Detect two-sided range predicates.
>  * Compute selectivity using histogram cumulative distribution rather than 
> independent multiplication.
>  * Correctly handle all combinations of inclusive and exclusive bounds.
> Most of the required logic is already present. The improvement should 
> primarily involve refactoring and generalizing the existing range-handling 
> code.
> The {{SEARCH}} operator internally represents ranges and can be leveraged to 
> make the implementation more generic and robust.
> As part of this change, the current {{SEARCH}} expansion performed by 
> {{SearchTransformer}} inside 
> [FilterSelectivityEstimator|https://github.com/apache/hive/blob/fb5e8703d915ec045a3c0148a765c9554b22bfe9/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/stats/FilterSelectivityEstimator.java#L111]
>  will likely need to be removed or reworked to preserve range semantics 
> during estimation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to